TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics
Implausibility

1: INSAIT, Sofia University "St. Kliment Ohridski", Bulgaria  •  2: Visual Geometry Group, University of Oxford

TL;DR

ImplausiBench Leaderboard 🏆

Accuracies (%) on Implausible (generated) and Real subsets (150 videos each). We report both Human and LLM‑judge scores.

Model Implausible — Human Implausible — LLM Real — Human Real — LLM
LLaVA‑NeXT (TRAVL)
52.7
28.7
47.3
31.3
Gemini 2.5 Pro
41.3
29.3
100.0
78.0
LLaVA‑NeXT (SFT)
34.0
22.0
45.3
23.3
GPT‑4o
32.7
21.3
84.7
64.0
Qwen2.5‑VL
18.7
12.0
96.7
74.7
InternVideo 2.5
12.7
4.7
92.7
76.0
LLaVA‑NeXT (pretrained)
3.3
2.7
98.7
62.7

We report both Human judge and LLM judge metrics for all the models, Human judge being the gold standard.

ImplausiBench Example 📹

Each scenario has a plausible (real) clip and a matched implausible (generated) counterpart sharing the first frame & style.

Real
Implausible

Multiple‑Choice QA ⁉️

Questions are adversarially constructed to remove language‑only shortcuts. Hover over the videos above to see the correct answer based on the video.

Q: Do the events in the video appear to follow physics principles (Real) or not (Implausible)? Why?
A. Real, because the marble hits and topples the first domino, triggering a chain reaction that causes all of them to fall.
B. Implausible, because the marble passes through the dominoes without affecting them.
C. Real, because the marble rolls past the dominoes without hitting them, so they remain upright as expected.
D. Implausible, because the dominoes hover above the marble, which defies gravity and real-world physics.
E. Real, because the stacked-up dominoes fall realistically when struck by the rolling marble.
F. Implausible, because the marble changes direction abruptly and the dominoes move without being touched, which defies physical causality.
G. None of the given reasons is entirely correct.

TRAVL Training Dataset🗃️

This dataset contains 3,482 videos and 19,708 QA pairs spanning real and implausible clips from multiple sources, including Physics‑IQ, Impossible Videos, and Video‑ChatGPT. Below are three short examples—hover to reveal the associated Question and Answer

Question: What occurs when the yellow rubber ducky is released above the fabric box?
Answer: Upon release, the duck defies expectations by not falling into the box as gravity would dictate. Instead, it suddenly increases in size and levitates above the box, which is physically implausible because it contradicts the natural effects of gravity without any apparent external forces acting on it.
Question: What sequence of events is depicted in the video involving the candles?
Answer: The video begins by showing storm clouds and lightning from a window, alongside a small lit candle placed next to the window. As the video progresses, the candle starts to grow taller. Then, a second candle appears suddenly, duplicating from the first. Finally, a curtain inexplicably enters the scene, revealing a third, much larger candle.
Question: What happens to the cake as the video progresses?
Answer: As the video progresses, the left side of the chocolate cake starts to deform and crumple as if someone is taking a bite out of it. However, this happens without any visible force or person interacting with the cake, which is physically implausible. In reality, for the cake to deform in such a way, an external force, such as a person's bite or a utensil's action, would be necessary.

Method: TRAVL

Spatial self‑attention within frames + trajectory‑guided temporal attention across frames (from CoTracker trajectories). Frozen vision & LLM backbones; train only the added attention + projector.

TRAVL overview

Podcast 📻

On the go? Listen to a short walkthrough of the ideas and findings.

BibTeX

@article{motamed2025travl, title={TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility}, author={Saman Motamed and Minghao Chen and Luc Van Gool and Iro Laina}, year={2025}, eprint={2510.07550}, archivePrefix={arXiv}, primaryClass={cs.CV} }