Me
Enxin Song (宋恩欣)
Master Student @ ZJU

About

Enxin Song is a research intern at the University of California, San Diego (UCSD) under Prof. Zhuowen Tu. She will receive her M.S. in March 2026 from Zhejiang University, advised by Prof. Gaoang Wang (CVNext Lab), and holds a B.S. in Software Engineering from Dalian University of Technology. Previously, she was a research intern at Microsoft Research Asia. Enxin stays curious about the wider landscape of computer vision and deep learning, and actively seeks new collaboration opportunities. Her work centers on video understanding, highlighted by MovieChat, the first Large Mutli-Modal Model for hour-long video understanding. She has co-organized workshops and challenges on video understanding at CVPR 2024 and 2025. She is a highly self-motivated, and curious student applying to Ph.D. programs for 2026Fall.

Experiences

  • Feb. 2025 -- Present, University of California, San Diego (UCSD), USA
         Visiting Intern, Advised by Prof. Zhuowen Tu.
  • Nov. 2023 -- May. 2024, Media Computing Group, Microsoft Research Lab - Asia, Beijing, China
         Research Intern, Work on text-to-image generation

  • Education


    M.S.           Sep. 2023 - Mar. 2026 (expected)
                       Zhejiang University, Hangzhou, China.
                       1/82, M.S. in Artificial Intelligence
    B.S.           Sep. 2019 - Jun. 2023
                      Dalian University of Technology (DLUT) , Dalian, China.
                      21/385, B.S. in Software Engineering

    News

    • We are hosting two CVPR 2025 Video Understanding Challenge @ LOVE Track 1A and LOVE Track 1B.
    • We release Video-MMLU, a Massive Multi-Discipline Lecture Understanding Benchmark.
    • One paper accepted to CVPR 2025 workshop@Efficient Large Vision Models.
    • Our paper AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark is accepted by ICLR 2025.
    • Our paper Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis is accepted by ICLR 2025.

    Selected Publications and Manuscripts

    * Equal contribution. † Project lead. ‡ Corresponding author.

    Also see Google Scholar.

    Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark
    Video-MMLU is a massive benchmark designed to evaluate the capabilities of LMMs in understanding Multi-Discipline Lectures.
    AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
    ICLR, 2025
    AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
    MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
    CVPR, 2024
    MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

    Professional Service

  • Conference and Journal Refereeing:
         CVPR 2025, ICLR 2025, TMM 2024, PRCV 2023
  • Workshop Organization:
         Workshop on Long-form Video Understanding at CVPR 2025
         Workshop on Long-form Video Understanding at CVPR 2024
  • Teaching Assistant

  • ECE 445 Senior Design (Undergraduate) - Sping 2024
         Teaching Assistant (TA), with Prof. Gaoang Wang
  • Selected Honors & Awards

    • National Scholarship, 2024 (Zhejiang University)
    • National Scholarship, 2021 (Dalian University of Technology)
    Top