Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
Model / System | QT | QF | HE | HM | HH | Overall |
---|---|---|---|---|---|---|
Foundation Models | ||||||
Llama-3.1-405B | 0.34 | 0.17° | 0.35 | 0.32 | 0.25 | 0.30 |
claude-3-5-sonnet-20241022 | 0.44 | 0.28• | 0.42 | 0.42 | 0.36 | 0.40 |
gpt-4o-2024-11-20 | 0.42 | 0.28• | 0.39 | 0.43 | 0.35 | 0.38 |
o1-2024-12-17 | 0.54 | 0.36• | 0.56 | 0.52 | 0.44 | 0.49 |
DeepSeek-R1 | 0.45 | 0.27° | 0.46 | 0.44 | 0.35 | 0.41 |
AI Assistants | ||||||
Microsoft Copilot | 0.29 | 0.23• | 0.29 | 0.32 | 0.22 | 0.27 |
Mistral Le Chat | 0.40 | 0.27• | 0.47 | 0.38 | 0.32 | 0.37 |
Perplexity Pro Search | 0.31 | 0.15• | 0.29 | 0.29 | 0.24 | 0.27 |
ChatGPT-4o | 0.53 | 0.36 | 0.60 | 0.52 | 0.41 | 0.49 |
Agentic Systems | ||||||
HuggingFace Agents + Claude 3.5 Sonnet | 0.61 | 0.41• | 0.60 | 0.56 | 0.54 | 0.56 |
DynaSaur + GPT-4o | 0.58 | 0.27 | 0.61 | 0.52 | 0.44 | 0.50 |
Operator | 0.57 | 0.46• | 0.56 | 0.56 | 0.52 | 0.54 |
Baselines | ||||||
Search Engine | 0.05 | 0.03• | 0.08 | 0.05 | 0.02 | 0.04 |
Human | 0.98 | 1.00 | 0.98 | 0.98 | 0.99 | 0.98 |
Table 1: System and model performance on the BLUR benchmark. QT and QF denote performance on text-only queries and queries with file inputs, respectively. System support for file inputs is indicated, where ° signifies that the system does not support file uploads and • denotes partial support of certain file type extensions; the absence of a circle denotes that all file type uploads are supported. HE, HM, and HH represent system performance on easy, medium, and hard query difficulty subsets, respectively.