About

Enxin Song is a research intern at the University of California, San Diego (UCSD) under Prof. Zhuowen Tu. She will receive her M.S. in March 2026 from Zhejiang University, advised by Prof. Gaoang Wang (CVNext Lab), and holds a B.S. in Software Engineering from Dalian University of Technology. Previously, she was a research intern at Microsoft Research Asia. Enxin stays curious about the wider landscape of computer vision and deep learning, and actively seeks new collaboration opportunities. Her work centers on video understanding, highlighted by MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. She has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025. She is a highly self-motivated, and curious student applying to Ph.D. programs for 2026Fall. You can view her Curriculum Vitae and undergraduate transcript.

Experiences

Feb. 2025 – Present

University of California, San Diego (UCSD), USA

Visiting Intern

Advised by Prof. Zhuowen Tu.

Nov. 2023 – May 2024

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Worked on text-to-image generation.

Education

Sep. 2023 – Mar. 2026 (expected)

M.S., Zhejiang University, Hangzhou, China

Artificial Intelligence

Ranked 1/82 in the M.S. program.

Sep. 2019 – Jun. 2023

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering

Ranked 21/385 in the undergraduate cohort.

News

Oct 2025 Video-MMLU received the Outstanding Paper Award at the ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop, along with a travel grant.
Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.
Jul 2025 Our paper Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark is accepted by ICCV 2025 Findings.

Selected Publications and Manuscripts

* Equal contribution.

Also see Google Scholar.

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song*, Wenhao Chai, Shusheng Yang, Ethan J. Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Preprint, 2025

VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang,

ICCVW, 2025

Paper Website Benchmark Code

Video-MMLU is a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures.

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu*, Enxin Song*, Wenhao Chai, Xuexiang Wen, Tian Ye, Gaoang Wang,

ICCV, 2025

Paper

Video-MMLU uses a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states to solve long video understanding tasks.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai*, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning

ICLR, 2025

Paper Website Model Benchmark Code

AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

CVPR, 2024

Paper Website Benchmark Code

MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

Professional Service

Conference and Journal Refereeing:
NeurIPS 2025, PRCV 2023&2025, CVPR 2025, ICLR 2025&2026, TMM 2024, TPAMI 2025

Workshop Organization:
Workshop on Long-form Video Understanding at CVPR 2025
Workshop on Long-form Video Understanding at CVPR 2024

Teaching Assistant

ECE 445 Senior Design (Undergraduate) - Sping 2024
Teaching Assistant (TA), with Prof. Gaoang Wang

Selected Honors & Awards

Lambda AI Cloud Credits Grant Sponsorship, 2025
National Scholarship, 2025 (Zhejiang University)
National Scholarship, 2024 (Zhejiang University)
National Scholarship, 2021 (Dalian University of Technology)

Top