AbsenceBench is a benchmark evaluating LLMs’ ability to identify deliberately removed content in documents.
Tests span numerical sequences, poetry, and GitHub pull requests, requiring models to spot omissions given original and edited versions.
State-of-the-art models (e.g., Claude-3.7-Sonnet) achieve only 69.6% F1-score on tasks with ~5K token context.
Transformer attention mechanisms struggle to detect gaps because missing content does not correspond to tokens that can be attended to.
Highlights a clear contrast between superhuman retrieval tasks and surprising breakdowns on omission detection.
Get notified when new stories are published for "🇺🇸 Hacker News English"