Me
Enxin Song (宋恩欣)
Master Student @ ZJU

About

Hi, I am currently a graduate student at Zhejiang University, CVNext Lab advised by Gaoang Wang. I obtained my B.S. degree in Software Engineering from Dalian University of Technology. Previously, I was a research intern at Microsoft Research Asia. My research primarily in large multimodal models (LMMs) for video understanding, and generative models.

Education


M.S.           Sep. 2023 - Mar. 2026 (expected)
                   Zhejiang University, Hangzhou, China.
                   Master's in Artificial Intelligence
B.S.           Sep. 2019 - Jun. 2023
                  Dalian University of Technology (DLUT) , Dalian, China.
                  B.S. in Software Engineering

Experiences

  • Feb. 2025 -- Present, University of California, San Diego (UCSD), USA
         Visiting Intern, Advised by Prof. Zhuowen Tu.
  • Nov. 2023 -- May. 2024, Media Computing Group, Microsoft Research Lab - Asia, Beijing, China
         Research Intern, Work on text-to-image generation

  • News

    • We are hosting CVPR 2025 Video Understanding Challenge @ LOVEU.
    • Our paper AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark is accepted by ICLR 2025.
    • Our paper Meissonic: Revitalizing masked generative transformers for efficient high-resolution text-to-image synthesis is accepted by ICLR 2025.

    Selected Publications and Manuscripts

    * Equal contribution. † Project lead. ‡ Corresponding author.

    Also see Google Scholar.

    AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
    ICLR, 2025
    AuroraCap is a multimodal LLM designed for image and video detailed captioning. We also release VDC, the first benchmark for detailed video captioning.
    Fantasy: Transformer Meets Transformer in Text-to-Image Generation
    Preperint, 2024
    [ Paper]
    Fantasy, an efficient text-to-image generation model marrying the decoder-only Large Language Models (LLMs) and transformer-based masked image modeling (MIM).
    MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
    CVPR, 2024
    MovieChat achieves state-of-the-art performace in extra long video (more than 10K frames) understanding by introducing memory mechanism.

    Professional Service

  • Conference and Journal Refereeing:
         CVPR 2025, ICLR 2025, TMM 2024, PRCV 2023
  • Workshop Organization:
         Workshop on Long-form Video Understanding at CVPR 2025
         Workshop on Long-form Video Understanding at CVPR 2024
  • Teaching Assistant

  • ZJUI Senior Design - Sping 2024
         Teaching Assistant (TA), with Prof. Gaoang Wang
  • Selected Honors & Awards

    • National Scholarship, 2024 (Zhejiang University)
    • National Scholarship, 2021 (Dalian University of Technology)
    Top