TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics
Implausibility

Saman Motamed^1,2, Minghao Chen², Luc Van Gool¹, Iro Laina²

1: INSAIT, Sofia University "St. Kliment Ohridski", Bulgaria • 2: Visual Geometry Group, University of Oxford

🧑‍💻 Code (GitHub) 📝 arXiv 🤗 ImplausiBench Dataset 🤗 TRAVL Tuning Dataset 📄 PDF 📻 Podcast

TL;DR

👨🏽‍🍳 Recipe: spatial attention + trajectory‑guided temporal attention on frozen VLM backbones.
🗂️ Training set: 3,482 videos with 19,708 physics‑focused Q/A pairs (real & implausible).
🧪 ImplausiBench: 300 paired clips (real vs. implausible) with grounded MCQs; scored by humans and an LLM judge.
📈 Outcome: stronger physics‑implausibility judgments without modifying the vision/LLM backbones.

ImplausiBench Leaderboard 🏆

Accuracies (%) on Implausible (generated) and Real subsets (150 videos each). We report both Human and LLM‑judge scores.

Model	Implausible — Human	Implausible — LLM	Real — Human	Real — LLM
LLaVA‑NeXT (TRAVL)	52.7	28.7	47.3	31.3
Gemini 2.5 Pro	41.3	29.3	100.0	78.0
LLaVA‑NeXT (SFT)	34.0	22.0	45.3	23.3
GPT‑4o	32.7	21.3	84.7	64.0
Qwen2.5‑VL	18.7	12.0	96.7	74.7
InternVideo 2.5	12.7	4.7	92.7	76.0
LLaVA‑NeXT (pretrained)	3.3	2.7	98.7	62.7

We report both Human judge and LLM judge metrics for all the models, Human judge being the gold standard.

ImplausiBench Example 📹

Each scenario has a plausible (real) clip and a matched implausible (generated) counterpart sharing the first frame & style.

Real

Implausible

Multiple‑Choice QA ⁉️

Questions are adversarially constructed to remove language‑only shortcuts. Hover over the videos above to see the correct answer based on the video.

Q: Do the events in the video appear to follow physics principles (Real) or not (Implausible)? Why?

A. Real, because the marble hits and topples the first domino, triggering a chain reaction that causes all of them to fall.

B. Implausible, because the marble passes through the dominoes without affecting them.

C. Real, because the marble rolls past the dominoes without hitting them, so they remain upright as expected.

D. Implausible, because the dominoes hover above the marble, which defies gravity and real-world physics.

E. Real, because the stacked-up dominoes fall realistically when struck by the rolling marble.

F. Implausible, because the marble changes direction abruptly and the dominoes move without being touched, which defies physical causality.

G. None of the given reasons is entirely correct.

TRAVL Training Dataset🗃️

This dataset contains 3,482 videos and 19,708 QA pairs spanning real and implausible clips from multiple sources, including Physics‑IQ, Impossible Videos, and Video‑ChatGPT. Below are three short examples—hover to reveal the associated Question and Answer

Method: TRAVL

Spatial self‑attention within frames + trajectory‑guided temporal attention across frames (from CoTracker trajectories). Frozen vision & LLM backbones; train only the added attention + projector.

Intra‑frame spatial attention (structure, deformation, overlaps)
Trajectory‑aware temporal attention (continuity, persistence, teleportation detection)
Chunked temporal windows for dense‑token setups (e.g., LLaVA-NeXT)
Lightweight: no changes to vision encoder / language model

Podcast 📻

On the go? Listen to a short walkthrough of the ideas and findings.

BibTeX

@article{motamed2025travl, title={TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility}, author={Saman Motamed and Minghao Chen and Luc Van Gool and Iro Laina}, year={2025}, eprint={2510.07550}, archivePrefix={arXiv}, primaryClass={cs.CV} }