Accuracies (%) on Implausible (generated) and Real subsets (150 videos each). We report both Human and LLM‑judge scores.
Model | Implausible — Human | Implausible — LLM | Real — Human | Real — LLM |
---|---|---|---|---|
LLaVA‑NeXT (TRAVL) | 52.7 |
28.7 |
47.3 |
31.3 |
Gemini 2.5 Pro | 41.3 |
29.3 |
100.0 |
78.0 |
LLaVA‑NeXT (SFT) | 34.0 |
22.0 |
45.3 |
23.3 |
GPT‑4o | 32.7 |
21.3 |
84.7 |
64.0 |
Qwen2.5‑VL | 18.7 |
12.0 |
96.7 |
74.7 |
InternVideo 2.5 | 12.7 |
4.7 |
92.7 |
76.0 |
LLaVA‑NeXT (pretrained) | 3.3 |
2.7 |
98.7 |
62.7 |
We report both Human judge and LLM judge metrics for all the models, Human judge being the gold standard.
Each scenario has a plausible (real) clip and a matched implausible (generated) counterpart sharing the first frame & style.
Questions are adversarially constructed to remove language‑only shortcuts. Hover over the videos above to see the correct answer based on the video.
This dataset contains 3,482 videos and 19,708 QA pairs spanning real and implausible clips from multiple sources, including Physics‑IQ, Impossible Videos, and Video‑ChatGPT. Below are three short examples—hover to reveal the associated Question and Answer
Spatial self‑attention within frames + trajectory‑guided temporal attention across frames (from CoTracker trajectories). Frozen vision & LLM backbones; train only the added attention + projector.
On the go? Listen to a short walkthrough of the ideas and findings.