Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning

The BLUR Leaderboard

Dataset: Link; Paper: Link

Model / System	Q_T	Q_F	H_E	H_M	H_H	Overall
Foundation Models
Llama-3.1-405B	0.34	0.17°	0.35	0.32	0.25	0.30
claude-3-5-sonnet-20241022	0.44	0.28•	0.42	0.42	0.36	0.40
gpt-4o-2024-11-20	0.42	0.28•	0.39	0.43	0.35	0.38
o1-2024-12-17	0.54	0.36•	0.56	0.52	0.44	0.49
DeepSeek-R1	0.45	0.27°	0.46	0.44	0.35	0.41
AI Assistants
Microsoft Copilot	0.29	0.23•	0.29	0.32	0.22	0.27
Mistral Le Chat	0.40	0.27•	0.47	0.38	0.32	0.37
Perplexity Pro Search	0.31	0.15•	0.29	0.29	0.24	0.27
ChatGPT-4o	0.53	0.36	0.60	0.52	0.41	0.49
Agentic Systems
HuggingFace Agents + Claude 3.5 Sonnet	0.61	0.41•	0.60	0.56	0.54	0.56
DynaSaur + GPT-4o	0.58	0.27	0.61	0.52	0.44	0.50
Operator	0.57	0.46•	0.56	0.56	0.52	0.54
Baselines
Search Engine	0.05	0.03•	0.08	0.05	0.02	0.04
Human	0.98	1.00	0.98	0.98	0.99	0.98

Model / System

Q_T

Q_F

H_E

H_M

H_H

Overall

Foundation Models

Llama-3.1-405B

0.34

0.17°

0.35

0.32

0.25

0.30

claude-3-5-sonnet-20241022

0.44

0.28•

0.42

0.36

0.40

gpt-4o-2024-11-20

0.42

0.28•

0.39

0.43

0.35

0.38

o1-2024-12-17

0.54

0.36•

0.56

0.52

0.44

0.49

DeepSeek-R1

0.45

0.27°

0.46

0.44

0.35

0.41

AI Assistants

Microsoft Copilot

0.29

0.23•

0.29

0.32

0.22

0.27

Mistral Le Chat

0.40

0.27•

0.47

0.38

0.32

0.37

Perplexity Pro Search

0.31

0.15•

0.29

0.24

0.27

ChatGPT-4o

0.53

0.36

0.60

0.52

0.41

0.49

Agentic Systems

HuggingFace Agents + Claude 3.5 Sonnet

0.61

0.41•

0.60

0.56

0.54

0.56

DynaSaur + GPT-4o

0.58

0.27

0.61

0.52

0.44

0.50

Operator

0.57

0.46•

0.56

0.52

0.54

Baselines

Search Engine

0.05

0.03•

0.08

0.05

0.02

0.04

Human

0.98

1.00

0.98

0.99

0.98

Table 1: System and model performance on the BLUR benchmark. Q_T and Q_F denote performance on text-only queries and queries with file inputs, respectively. System support for file inputs is indicated, where ° signifies that the system does not support file uploads and • denotes partial support of certain file type extensions; the absence of a circle denotes that all file type uploads are supported. H_E, H_M, and H_H represent system performance on easy, medium, and hard query difficulty subsets, respectively.

The BLUR Leaderboard