Video-MMLU

Video-MMLU pushes LMMs to the limits—can the model really understand real-world lectures?

Leaderboard

We evaluate more than 90 proprietary models and open-source models of varying sizes on Video- MMLU. Our findings indicate that existing models generally perform poorly, with accuracy lower than 50%.

Findings

We explore how the number of visual tokens and the base LLMs influence performance, offering insights into the interplay between multimodal perception and reasoning in lecture comprehension.

Leaderboard

Please remember to report your frame rate and tokens per frame with each submission. Email us at or .

#F stands for the frame sampling number of the input video, and #T stands for the visual tokens per frame.

Findings

Finding 1. Large scale LMMs do not show clear advantages over smaller ones.

Although LMM scaling laws suggest significant performance improvements with increased model size, this trend is less pronounced in Video-MMLU. Model size shows a stronger correlation with performance in video QA compared to video captioning, implying reasoning benefits more from scaling.

Finding 2. LLM Architecture shapes LMMs' balance between perception and reasoning.

Most models excel in captioning over QA, highlighting the greater reasoning challenge lecture QA in Video-MMLU. LMMs built on Qwen2.5 and InternLM2.5 achieve strong and balanced performance, while MoE-based LLMs also perform well. Earlier architectures like Vicuna and LLaMA2 perform poorly in QA for its weaker reasoning and instruction-following capabilities.

Finding 3. Can LMMs with visual token compression sustain strong performance in complex, context-rich lecture understanding tasks like Video-MMLU?

Token compression boosts efficiency but still lags significantly behind SOTA models.
Architecture and compression are as crucial as token count, affecting performance variance.
Significant token reduction works while maintaining performance, but ultra-low counts cause sharp drops.
16–300 tokens per frame is the optimal efficiency-performance balance.
Larger models can partially offset compression's information loss.
AuroraCap's non-linear curve shows the need for domain-specific token optimization.

Finding 4. Lecture understanding in models relies more on textual content in frames than on animations.

Models significantly excel in physics and chemistry lectures, which contain more textual explanations, while mathematics lectures emphasizes formulas and dynamic visual proofs.

Finding 5. Larger LLMs enhance lecture understanding but with diminishing returns.

As model size increases, Qwen2.5 exhibits continuous performance gains, while InternVL2.5 (Qwen2.5 as LLM) gains gradually diminish with increasing model size.

Lecture Cases

"MVPs: Wordless Animations of Five Classic Proofs without Words" presents an intricate journey through mathematical concepts using purely visual demonstrations. The video opens with a stark black background featuring the title and subtitle, accompanied by a subscribe button and thumbs-up icon, setting the stage for an engaging educational experience.

The presentation begins with a foundational exploration of geometric series, displaying the mathematical expression \( \frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \ldots + \frac{1}{2^k} + \ldots \). This infinite series is beautifully illustrated through a square divided into colored sections, where the left half is colored in deep purple representing \( \frac{1}{2} \), while the remaining space is partitioned into increasingly smaller sections in teal and purple, labeled with fractions \( \frac{1}{4} \), \( \frac{1}{8} \), \( \frac{1}{16} \), and \( \frac{1}{32} \). The visual demonstrates how these fractions collectively sum to 1, with each subsequent fraction representing half of the previous one.

The video then transitions to exploring the sum of the first n natural numbers, presenting the equation \( 1 + 2 + 3 + \ldots + n = \frac{n(n + 1)}{2} \). This concept is visualized through a triangular arrangement of light blue squares, forming a right triangle pattern where each row contains one more square than the row above it. The dimensions are carefully labeled with n, demonstrating the relationship between the height and base of the triangle.

A particularly elegant proof involves the difference of squares formula, \( a^2 - b^2 = (a - b)(a + b) \). This is demonstrated through multiple visual representations, including a large purple square representing \( a^2 \) and a smaller square representing \( b^2 \), with the difference illustrated through L-shaped sections and rectangular divisions. The dimensions are clearly labeled with a and b, showing how the factored form relates to the geometric representation.

The Pythagorean theorem receives special attention through a series of illustrations, including a light blue right triangle with sides labeled a, b, and c, accompanied by the classic equation \( c^2 = a^2 + b^2 \). This is further elaborated through a square grid divided into four sections, with colored squares in purple and light blue demonstrating the relationship between the areas of squares formed by the triangle's sides.

The presentation also explores the sum of odd numbers through the equation \( 1 + 3 + 5 + ... + (2n - 3) + (2n - 1) = n^2 \), using a grid of green and light blue squares to demonstrate how odd numbers sum to perfect squares. Each step in the progression is carefully illustrated through staircase-like arrangements of colored squares.

Throughout the video, minimalist black silhouettes of standing figures appear at key transitions, suggesting a lecture-style presentation. A simple light bulb illustration also appears, symbolizing moments of insight or understanding. The presentation concludes by attributing these proofs to various historical figures, including ancient Greeks, Chinese mathematicians, Nicomachus of Gerasa, and Warren Page.

Each proof is meticulously constructed using a consistent color palette of purples, blues, and teals against a black background, ensuring maximum visibility and clarity. The visual elements are carefully labeled with appropriate mathematical notation, creating a seamless blend of geometric and algebraic representations that effectively communicate complex mathematical concepts without the need for words.

The educational video on osmosis and tonicity, produced by Ricochet Science, presents a comprehensive exploration of these fundamental biological processes through clear visual elements and detailed explanations. The presentation opens with a bold black title against a white background, immediately followed by the distinctive Ricochet Science logo featuring a stylized laboratory flask with rising bubbles in black and teal, establishing its educational context.

At its core, the video establishes that osmosis represents the diffusion of water across a semipermeable membrane, while tonicity refers to the relative solute concentration between two environments separated by such a membrane. This foundational concept is illustrated through a detailed central diagram showing a semipermeable membrane dividing two distinct environments. The left side, labeled "Hypotonic Environment," contains a solution of 25% sodium chloride and 75% water, while the right side, labeled "Hypertonic Environment," shows 75% sodium chloride and 25% water. The diagram uses color-coded spheres - blue for water molecules, purple for sodium ions (Na⁺), and orange for chloride ions (Cl⁻) - to clearly demonstrate molecular composition. Yellow arrows indicate the fundamental principle that water molecules move from the hypotonic to the hypertonic environment, always flowing toward areas of higher solute concentration.

The presentation then explores three critical scenarios using beakers containing red blood cells in different solutions. In the first scenario, red blood cells are suspended in pure water (a hypotonic solution). The video explains that the cells' internal environment is hypertonic relative to the surrounding water, causing water to flow into the cells through osmosis. This influx of water can potentially lead to cell lysis, or bursting, if sufficient water enters the cells. The second scenario demonstrates the opposite condition, where red blood cells are placed in a hypertonic sodium chloride solution (represented by blue liquid). This environment causes crenation, or cell shrinkage, as water moves out of the cells toward the more concentrated external solution. The third scenario illustrates isotonic conditions, where the solution's tonicity matches that of the cells, resulting in no net water movement and maintaining cellular stability.

Throughout the presentation, these concepts are reinforced against a gradient purple background that enhances visibility and comprehension. The semipermeable membrane is consistently depicted as a selective barrier, allowing water passage while restricting other substances. This selective permeability is crucial for understanding cellular homeostasis and various physiological processes, including kidney function and the effects of diseases like diabetes, which the video briefly mentions as practical applications of these principles.

The comprehensive visual journey concludes with the Ricochet Science logo and copyright information from Ricochet Creative Productions, LLC (2013), maintaining its professional educational approach throughout. The presentation effectively combines theoretical concepts with practical examples, helping viewers understand how osmosis and tonicity influence cellular behavior in various environmental conditions, making complex biological processes accessible to learners at different levels.