Enxin Song is a Ph.D. student in Computer and Information Science at the University of Pennsylvania, advised by Prof. Jiatao Gu. She received her master's degree from Zhejiang University, and her bachelor's degree from Dalian University of Technology. She has conducted research at University of California, San Diego with Prof. Zhuowen Tu. Her research centers on video understanding and generative models, with a focus on efficient long-sequence modeling, applications of generative models for text-to-image synthesis, and benchmarking for video understanding. She has co-organized workshops at CVPR 2024 and 2025.

News

Jan 2026 Our VideoNSA is accepted by ICLR 2026.
Nov 2025 Selected (top 10%) to give a talk at the KAUST Rising Stars in AI Symposium 2026.
Oct 2025 Video-MMLU received the Outstanding Paper Award at the ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop, along with a travel grant.
Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.

Education

Fall 2026 – Present

Ph.D., University of Pennsylvania, Philadelphia, USA

Computer and Information Science

Advised by Prof. Jiatao Gu.

Sep. 2023 – Mar. 2026

M.S., Zhejiang University, Hangzhou, China

Advised by Prof. Gaoang Wang.

Sep. 2019 – Jun. 2023

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering

Research Experience

Spring – Summer 2025

University of California, San Diego (UCSD), USA

Visiting Researcher

Advised by Prof. Zhuowen Tu.

Nov. 2023 – May 2024

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Multimedia Computing Group. Mentor: Xun Guo.

Professional Service

Conference & Journal Refereeing

IJCV 2026, ICML 2026, CVPR 2025 & 2026, ICLR 2025 & 2026, NeurIPS 2025, TPAMI 2025, PRCV 2023 & 2025, TMM 2024.

Workshop Organization

Invited Talks

Feb 2026

From Compression to Selection: Better and Longer Video Understanding

KAUST Rising Stars in AI Symposium, Saudi Arabia · Slides
Oct 2025

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Workshop on Knowledge MR at ICCV 2025, Oahu, HI · Slides
Sep 2025

From Seeing to Thinking

Lambda AI (virtual) · Slides

Teaching

Spring 2024

ECE 445 Senior Design (Undergraduate)

Teaching Assistant with Prof. Gaoang Wang.

Selected Honors & Awards

2026
KAUST Rising Stars in AI Symposium
2025
Outstanding Paper Award, ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop
2025
Lambda AI Cloud Credits Grant Sponsorship
2025
National Scholarship, Zhejiang University
2024
National Scholarship, Zhejiang University
2021
National Scholarship, Dalian University of Technology

Research Overview

My research centers on video understanding and generative models, with key areas of focus including:

Efficient Long-Sequence Modeling , especially for long video inputs, using techniques like hybrid memory, token compression, RNNs, sparse attention, and linear attention mechanism.
Applications of Generative Models , with an emphasis on masked image modeling for text-to-image synthesis, and a strong focus on enhancing efficiency in data usage and training.
Benchmarking and Evaluation , creating complex and meaningful real-world challenges in video domains to probe the boundaries of model capabilities, while providing insights for future enhancement.

Selected Publications and Manuscripts

* Equal contribution · Google Scholar

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan J. Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

ICLR 2026

Paper · Website · Model · Code

VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Enxin Song*, Wenhao Chai*, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang

TPAMI 2025

Paper · Code

MovieChat+ extends the hybrid memory framework to question-aware sparse memory for long video question answering.

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark

Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, Gaoang Wang

ICCV Findings 2025

Paper · Website · Benchmark · Code

Video-MMLU is a massive benchmark for evaluating LMMs on multi-discipline lecture understanding.

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu*, Enxin Song*, Wenhao Chai*, Tian Ye, Gaoang Wang

ICCV 2025

Paper

AuroraLong uses a linear RNN language model with constant-size hidden states to handle arbitrary-length video inputs.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai*, Enxin Song*, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning

ICLR 2025

Paper · Website · Model · Benchmark · Code

AuroraCap is a multimodal LLM for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, Shuicheng Yan

ICLR 2025

Paper · Code

Meissonic revitalizes masked generative transformers for efficient, high-resolution text-to-image synthesis.

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-modality Models

Haodong Duan, Xinyu Fang, Junming Yang, …, Enxin Song, et al.

ACM MM 2024

Paper · Code

VLMEvalKit is an open-source evaluation toolkit supporting 100+ large multimodal models across 60+ benchmarks.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

CVPR 2024

Paper · Website · Benchmark · Code

MovieChat achieves state-of-the-art performance in extra long video (>10K frames) understanding by introducing a hybrid dense-token / sparse-memory mechanism.

Blog

Occasional writing on research, tools, and things I find interesting.

life

CS Ph.D. Application — 2026 Fall

April 20, 2026 · 3 min read