Enxin Song is a Ph.D. student in Computer and Information Science at the University of Pennsylvania, advised by Prof. Jiatao Gu. She received her master's degree from Zhejiang University, and her bachelor's degree from Dalian University of Technology. She has conducted research at New York University with Prof. Saining Xie and at University of California, San Diego with Prof. Zhuowen Tu. Her research centers on video understanding and generative models, with a focus on efficient long-sequence modeling, applications of generative models for text-to-image synthesis, and benchmarking for video understanding. She has co-organized workshops at CVPR 2024 and 2025.

News

  • Jan 2026 Our VideoNSA is accepted by ICLR 2026.
  • Nov 2025 Selected (top 10%) to give a talk at the KAUST Rising Stars in AI Symposium 2026.
  • Oct 2025 Video-MMLU received the Outstanding Paper Award at the ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop, along with a travel grant.
  • Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
  • Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
  • Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
  • Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.

Education

Ph.D., University of Pennsylvania, Philadelphia, USA

Computer and Information Science

Advised by Prof. Jiatao Gu.

M.S., Zhejiang University, Hangzhou, China

Advised by Prof. Gaoang Wang.

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering


Research Experience

University of California, San Diego (UCSD), USA

Visiting Researcher

Advised by Prof. Zhuowen Tu.

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Multimedia Computing Group. Mentor: Xun Guo.


Professional Service

Conference & Journal Refereeing

IJCV 2026, ICML 2026, CVPR 2025 & 2026, ICLR 2025 & 2026, NeurIPS 2025, TPAMI 2025, PRCV 2023 & 2025, TMM 2024.

Invited Talks

  • Feb 2026
    From Compression to Selection: Better and Longer Video Understanding
  • Oct 2025
    Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
    Workshop on Knowledge MR at ICCV 2025, Oahu, HI
  • Sep 2025
    From Seeing to Thinking
    Lambda AI (virtual)

Teaching

Spring 2024
ECE 445 Senior Design (Undergraduate)

Teaching Assistant with Prof. Gaoang Wang.

Selected Honors & Awards

  • 2026
  • 2025
    Outstanding Paper Award, ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop
  • 2025
    Lambda AI Cloud Credits Grant Sponsorship
  • 2025
    National Scholarship, Zhejiang University
  • 2024
    National Scholarship, Zhejiang University
  • 2021
    National Scholarship, Dalian University of Technology

Research Overview

My research centers on video understanding and generative models, with key areas of focus including:

  • Efficient Long-Sequence Modeling , especially for long video inputs, using techniques like hybrid memory, token compression, RNNs, sparse attention, and linear attention mechanism.
    Token Merging Linear Attention Sparse Attention 2023.07 MovieChat CVPR 2024 2024.04 MovieChat+ TPAMI 2024.09 AuroraCap ICLR 2025 2025.01 AuroraLong ICCV 2025 2025.09 VideoNSA ICLR 2026
  • Applications of Generative Models , with an emphasis on masked image modeling for text-to-image synthesis, and a strong focus on enhancing efficiency in data usage and training.
  • Benchmarking and Evaluation , creating complex and meaningful real-world challenges in video domains to probe the boundaries of model capabilities, while providing insights for future enhancement.

Selected Publications and Manuscripts

* Equal contribution.

Also see Google Scholar.

VideoNSA: Native Sparse Attention Scales Video Understanding
ICLR, 2026
VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.
AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
ICCV, 2025
Video-MMLU uses a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states to solve long video understanding tasks.
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
ICLR, 2025
AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
CVPR, 2024
MovieChat achieves state-of-the-art performance in extra long video (more than 10K frames) understanding by introducing memory mechanism.
pixel cat