Video-MMLU


A Massive Multi-Discipline Lecture Understanding Benchmark


Enxin Song

Zhejiang University

Wenhao Chai

University of Washington

Weili Xu

UIUC

Jianwen Xie

Lambda, Inc.

Yuxuan Liu

UIUC

Gaoang Wang

Zhejiang University

Zhejiang University Logo

UW Logo

UIUC Logo

Lambda Inc. Logo

Leaderboard

🤩 Welcome to the Video-MMLU lecture hall! Is your model ready to be tested?

Class is in session! Submit your scores to see if your models make the honor roll. Professor is waiting...

Please remember to report your frame rate and tokens per frame with each submission. Email us at or .

#F stands for the frame sampling number of the input video, and #T stands for the visual tokens per frame.

Models marked with are open-source, those marked with are proprietary, and those marked with are base LLMs.

Findings

Large scale LMMs do not show clear advantages over smaller ones.

Although LMM scaling laws suggest significant performance improvements with increased model size, this trend is less pronounced in Video-MMLU. Model size shows a stronger correlation with performance in video QA compared to video captioning, implying reasoning benefits more from scaling.

Finding 4
LLM Architecture shapes LMMs' balance between perception and reasoning.

Most models excel in captioning over QA, highlighting the greater reasoning challenge lecture QA in Video-MMLU. LMMs built on Qwen2.5 and InternLM2.5 achieve strong and balanced performance, while MoE-based LLMs also perform well. Earlier architectures like Vicuna and LLaMA2 perform poorly in QA for its weaker reasoning and instruction-following capabilities.

Finding 4
Can LMMs with visual token compression sustain strong performance in complex, context-rich lecture understanding tasks like Video-MMLU?

  1. Token compression boosts efficiency but still lags significantly behind SOTA models.
  2. Architecture and compression are as crucial as token count, affecting performance variance.
  3. Significant token reduction works while maintaining performance, but ultra-low counts cause sharp drops.
  4. 16–300 tokens per frame is the optimal efficiency-performance balance.
  5. Larger models can partially offset compression's information loss.
  6. AuroraCap's non-linear curve shows the need for domain-specific token optimization.

Finding 4
Lecture understanding in models relies more on textual content in frames than on animations.

Models significantly excel in physics and chemistry lectures, which contain more textual explanations, while mathematics lectures emphasizes formulas and dynamic visual proofs.

Finding 2
Larger LLMs enhance lecture understanding but with diminishing returns.

As model size increases, Qwen2.5 exhibits continuous performance gains, while InternVL2.5 (Qwen2.5 as LLM) gains gradually diminish with increasing model size.

Finding 5

Lecture Cases

Benchmark Construction

Video-MMLU aims to rigorously evaluate the capabilities of Large Multimodal Models (LMMs) in perceiving and reasoning over real-world educational video. Unlike existing benchmarks, Video-MMLU focuses on videos filled with complex formulas, dynamic animations, and requiring multi-step reasoning.

Dataset Theme # Video # Ave. Duration (s) Caption Question-answering
Number # Word # Vocab. Ave. Length Number Type
MovieChat-1K Movie 1,000 564 1,000 121,077 102,988 121 13,000 OE
MMWorld Professional 1,910 107 1,910 - - 66 6,627 MC
MLVU Open 1,730 930 247 - - - 3,102 MC
MVBench Open 4,000 16 - - × - 4,000 MC
LongVideoBench Open 3,763 473 - - × - 6,678 MC
TempCompass Open 410 < 30 - - × - 7,540 MC
Video-MMMU Professional 300 506 - - × - 900 MC
VATEX Open 41,250 10 41,250 4,994,768 44,103 15 - ×
VDC Open 1,027 28 1,027 515,441 20,419 501 - ×
LongCaptioning Open 10,000 93 - - - 1,198 - ×
Video-MMLU (ours) Professional 1,065 109 1,065 520,679 27,613 489 15,746 OE
Table 1: Benchmark comparison for video understanding tasks. Ave. Length indicates the average number of words per caption. OE stands for Open-ended, and MC stands for Multiple-choice.

Inspired by the scene of a classroom, Video-MMLU treats the model as a student tasked with learning from the video lectures, while the dataset acts as the teacher. This paradigm involves two primary evaluation tasks: the model "takes notes" through generating detailed captioning for the video content, and it "takes a quiz" by answering challenging questions that require visual reasoning question answering based on the lecture material.

The construction of Video-MMLU involves collecting 1,065 videos sourced from 10 distinct educational YouTube channels. The collection heavily emphasizes Mathematics, complemented by Physics and Chemistry. Video durations were constrained to between 10 and 240 seconds, with a maximum cap at 4 minutes. Only videos with available English subtitles were included to provide a textual baseline.

FAQs

Video-MMLU is a benchmark designed to evaluate large multimodal models on their ability to understand and reason about real-world lecture videos across multiple domains and disciplines.

Unlike text-only or image-only benchmarks, Video-MMLU specifically tests comprehension of educational video content, requiring models to integrate visual, auditory, and temporal information to answer challenging questions.

The benchmark dataset and evaluation code are available through our GitHub repository. Researchers can use it to test their models and compare results with existing baselines.

BibTeX

@misc{song2024videommlu,
                    title={Video-MMLU: A Massive Multi-discipline Lecture Understanding Benchmark}, 
                    author={Enxin Song and Wenhao Chai and Weili Xu and Jianwen Xie and Yuxuan Liu and Gaoang Wang},
                    year={2024},
                    eprint={2407.04171},
                    archivePrefix={arXiv},
                    primaryClass={cs.CV}
                }