The BLUR Leaderboard

Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

Dataset: Link; Paper: Link

Model / System QT QF HE HM HH Overall
Foundation Models
Llama-3.1-405B 0.34 0.17° 0.35 0.32 0.25 0.30
claude-3-5-sonnet-20241022 0.44 0.28 0.42 0.42 0.36 0.40
gpt-4o-2024-11-20 0.42 0.28 0.39 0.43 0.35 0.38
o1-2024-12-17 0.54 0.36 0.56 0.52 0.44 0.49
DeepSeek-R1 0.45 0.27° 0.46 0.44 0.35 0.41
AI Assistants
Microsoft Copilot 0.29 0.23 0.29 0.32 0.22 0.27
Mistral Le Chat 0.40 0.27 0.47 0.38 0.32 0.37
Perplexity Pro Search 0.31 0.15 0.29 0.29 0.24 0.27
ChatGPT-4o 0.53 0.36 0.60 0.52 0.41 0.49
Agentic Systems
HuggingFace Agents + Claude 3.5 Sonnet 0.61 0.41 0.60 0.56 0.54 0.56
DynaSaur + GPT-4o 0.58 0.27 0.61 0.52 0.44 0.50
Operator 0.57 0.46 0.56 0.56 0.52 0.54
Baselines
Search Engine 0.05 0.03 0.08 0.05 0.02 0.04
Human 0.98 1.00 0.98 0.98 0.99 0.98

Table 1: System and model performance on the BLUR benchmark. QT and QF denote performance on text-only queries and queries with file inputs, respectively. System support for file inputs is indicated, where ° signifies that the system does not support file uploads and • denotes partial support of certain file type extensions; the absence of a circle denotes that all file type uploads are supported. HE, HM, and HH represent system performance on easy, medium, and hard query difficulty subsets, respectively.