Me
Enxin Song (宋恩欣)
Master Student @ ZJU

About

Enxin Song is a research intern at the University of California, San Diego (UCSD) under Prof. Zhuowen Tu. She will receive her M.S. in March 2026 from Zhejiang University, advised by Prof. Gaoang Wang (CVNext Lab), and holds a B.S. in Software Engineering from Dalian University of Technology. Previously, she was a research intern at Microsoft Research Asia. Enxin stays curious about the wider landscape of computer vision and deep learning, and actively seeks new collaboration opportunities. Her work centers on video understanding, highlighted by MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. She has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025. She is a highly self-motivated, and curious student applying to Ph.D. programs for 2026Fall. You can view her Curriculum Vitae and undergraduate transcript.

Experiences

University of California, San Diego (UCSD), USA

Visiting Intern

Advised by Prof. Zhuowen Tu.

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Worked on text-to-image generation.


Education

M.S., Zhejiang University, Hangzhou, China

Artificial Intelligence

Ranked 1/82 in the M.S. program.

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering

Ranked 21/385 in the undergraduate cohort.


News

  • Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
  • Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
  • Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
  • Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.
  • Jul 2025 Our paper Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark is accepted by ICCV 2025 Findings.

Selected Publications and Manuscripts

* Equal contribution.

Also see Google Scholar.

VideoNSA: Native Sparse Attention Scales Video Understanding
Preprint, 2025
VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.
Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
ICCVW, 2025
Video-MMLU is a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures.
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
ICCV, 2025
Video-MMLU uses a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states to solve long video understanding tasks.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
ICLR, 2025
AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
CVPR, 2024
MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

Professional Service

  • Conference and Journal Refereeing:
         NeurIPS 2025, PRCV 2023&2025, CVPR 2025, ICLR 2025&2026, TMM 2024, TPAMI 2025
  • Workshop Organization:
         Workshop on Long-form Video Understanding at CVPR 2025
         Workshop on Long-form Video Understanding at CVPR 2024
  • Teaching Assistant

  • ECE 445 Senior Design (Undergraduate) - Sping 2024
         Teaching Assistant (TA), with Prof. Gaoang Wang
  • Selected Honors & Awards

    • Lambda AI Cloud Credits Grant Sponsorship, 2025
    • National Scholarship, 2025 (Zhejiang University)
    • National Scholarship, 2024 (Zhejiang University)
    • National Scholarship, 2021 (Dalian University of Technology)
    Top