Enxin Song is a Ph.D. student in Computer and Information Science at the University of Pennsylvania, advised by Prof. Jiatao Gu. She received her master's degree from Zhejiang University, and her bachelor's degree from Dalian University of Technology. She has conducted research at New York University with Prof. Saining Xie and at University of California, San Diego with Prof. Zhuowen Tu. Her research centers on video understanding and generative models, with a focus on efficient long-sequence modeling, applications of generative models for text-to-image synthesis, and benchmarking for video understanding. She has co-organized workshops at CVPR 2024 and 2025.

News

  • Jan 2026 Our VideoNSA is accepted by ICLR 2026.
  • Nov 2025 Selected (top 10%) to give a talk at the KAUST Rising Stars in AI Symposium 2026.
  • Oct 2025 Video-MMLU received the Outstanding Paper Award at the ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop, along with a travel grant.
  • Oct 2025 We release VideoNSA, a hardware-aware native sparse attention mechanism for video understanding.
  • Sep 2025 Invited talk at Lambda AI titled From Seeing to Thinking.
  • Sep 2025 One paper accepted by ICCV 2025 KnowledgeMR Workshop.
  • Aug 2025 Our paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is accepted by IEEE TPAMI.

Education

Ph.D., University of Pennsylvania, Philadelphia, USA

Computer and Information Science

Advised by Prof. Jiatao Gu.

M.S., Zhejiang University, Hangzhou, China

Advised by Prof. Gaoang Wang.

B.S., Dalian University of Technology (DLUT), Dalian, China

Software Engineering

Research Experience

University of California, San Diego (UCSD), USA

Visiting Researcher

Advised by Prof. Zhuowen Tu.

Media Computing Group, Microsoft Research Lab - Asia, Beijing

Research Intern

Multimedia Computing Group. Mentor: Xun Guo.

Professional Service

Conference & Journal Refereeing

IJCV 2026, ICML 2026, CVPR 2025 & 2026, ICLR 2025 & 2026, NeurIPS 2025, TPAMI 2025, PRCV 2023 & 2025, TMM 2024.

Invited Talks

  • Feb 2026
    From Compression to Selection: Better and Longer Video Understanding
  • Oct 2025
    Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
    Workshop on Knowledge MR at ICCV 2025, Oahu, HI  ·  Slides
  • Sep 2025
    From Seeing to Thinking
    Lambda AI (virtual)  ·  Slides

Teaching

Spring 2024
ECE 445 Senior Design (Undergraduate)

Teaching Assistant with Prof. Gaoang Wang.

Selected Honors & Awards

  • 2026
  • 2025
    Outstanding Paper Award, ICCV 2025 Knowledge-Intensive Multimodal Reasoning Workshop
  • 2025
    Lambda AI Cloud Credits Grant Sponsorship
  • 2025
    National Scholarship, Zhejiang University
  • 2024
    National Scholarship, Zhejiang University
  • 2021
    National Scholarship, Dalian University of Technology

Research Overview

My research centers on video understanding and generative models, with key areas of focus including:

  • Efficient Long-Sequence Modeling , especially for long video inputs, using techniques like hybrid memory, token compression, RNNs, sparse attention, and linear attention mechanism.
    Token Merging Linear Attention Sparse Attention 2023.07 MovieChat CVPR 2024 2024.04 MovieChat+ TPAMI 2024.09 AuroraCap ICLR 2025 2025.01 AuroraLong ICCV 2025 2025.09 VideoNSA ICLR 2026
  • Applications of Generative Models , with an emphasis on masked image modeling for text-to-image synthesis, and a strong focus on enhancing efficiency in data usage and training.
  • Benchmarking and Evaluation , creating complex and meaningful real-world challenges in video domains to probe the boundaries of model capabilities, while providing insights for future enhancement.

Selected Publications and Manuscripts

* Equal contribution  ·  Google Scholar

VideoNSA: Native Sparse Attention Scales Video Understanding
ICLR 2026

VideoNSA delivers hardware-aware native sparse attention primitives for efficient video understanding systems.

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
TPAMI 2025

MovieChat+ extends the hybrid memory framework to question-aware sparse memory for long video question answering.

Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
ICCV Findings 2025

Video-MMLU is a massive benchmark for evaluating LMMs on multi-discipline lecture understanding.

AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding
ICCV 2025

AuroraLong uses a linear RNN language model with constant-size hidden states to handle arbitrary-length video inputs.

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
ICLR 2025

AuroraCap is a multimodal LLM for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
ICLR 2025

Meissonic revitalizes masked generative transformers for efficient, high-resolution text-to-image synthesis.

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-modality Models
ACM MM 2024

VLMEvalKit is an open-source evaluation toolkit supporting 100+ large multimodal models across 60+ benchmarks.

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
CVPR 2024

MovieChat achieves state-of-the-art performance in extra long video (>10K frames) understanding by introducing a hybrid dense-token / sparse-memory mechanism.

Blog

Occasional writing on research, tools, and things I find interesting.

pixel cat