AbsenceBench: Models Struggle to Detect Missing Information

1

AbsenceBench is a benchmark evaluating LLMs’ ability to identify deliberately removed content in documents.

2

Tests span numerical sequences, poetry, and GitHub pull requests, requiring models to spot omissions given original and edited versions.

3

State-of-the-art models (e.g., Claude-3.7-Sonnet) achieve only 69.6% F1-score on tasks with ~5K token context.

4

Transformer attention mechanisms struggle to detect gaps because missing content does not correspond to tokens that can be attended to.

5

Highlights a clear contrast between superhuman retrieval tasks and surprising breakdowns on omission detection.