We have developed VSRM (Video Super-Resolution Mamba), a robust deep learning framework that advances video super-resolution (VSR) by leveraging the efficiency of state-space modeling. This work was presented at the International Conference on Computer Vision (ICCV 2025, Rank A) in Hawaii, USA.
In VSRM, we introduce two core modules, the Spatial-to-Temporal Mamba (S2T-Mamba) and the Temporal-to-Spatial Mamba (T2S-Mamba), which enable bidirectional information flow and adaptive token interaction across frames. With this design, we effectively capture long-range spatiotemporal dependencies while preserving fine-grained details.
Unlike conventional transformer- or recurrent-based approaches, we employ selective state-space modeling (Mamba) to strike an optimal balance between global temporal modeling and localized spatial refinement. This allows us to achieve superior video quality while ensuring faster inference and lower computational cost, making VSRM highly suitable for real-time applications.
Through comprehensive experiments on benchmark datasets, we demonstrate that VSRM achieves PSNR improvements of up to 0.28 dB and SSIM gains of 0.004 compared to state-of-the-art transformer-based methods. At the same time, VSRM reduces FLOPs by more than 15% and the number of parameters by approximately 20%, showing that our method not only improves accuracy but also enhances computational efficiency.
Our results highlight the broad potential of VSRM for media streaming, surveillance, defense, and medical imaging, where both detail fidelity and real-time processing are essential. By introducing a novel spatiotemporal interaction mechanism that bridges the gap between spatial details and temporal coherence, we make real-time high-resolution video restoration more feasible than ever before.