A Better Newspaper

## N-Day-Bench – LLM Vulnerability Detection Benchmark ### Overview N-Day-Bench is a benchmark testing whether frontier large language models (LLMs) can autonomously discover known security vulnerabilities in real-world codebases (winfunc.com). The project addresses a recognized limitation of static vulnerability benchmarks: rapid contamination of training data causes models to memorize rather than reason. ### Methodology - **Monthly refresh**: New cases are pulled each month from GitHub security advisories (winfunc.com) - **Realistic conditions**: Repos are checked out at the last commit *before* the patch, giving models access to vulnerable code without the fix - **Sandboxed execution**: Models are given a bash shell to actively explore the codebase - **Multi-agent architecture**: Each case reportedly runs three agents — a Curator plus additional roles (winfunc.com) ### Why It Matters **Contamination problem**: Static benchmarks become unreliable as cases leak into LLM training data. N-Day-Bench's rolling refresh attempts to keep the evaluation set ahead of contamination, or at minimum makes the contamination window transparent (winfunc.com). **Legal & security implications**: As LLMs are increasingly deployed in security workflows, reliable benchmarks for vulnerability detection capability are essential for: - Procurement decisions by security teams - Liability analysis when AI-assisted security review misses known vulnerabilities - Regulatory compliance assessments ### Strategic Relevance - Directly relevant to the emerging AI cybersecurity tool market - Benchmarks of this type will likely be cited in procurement RFPs, insurance underwriting (cf. Cowbell Prime One), and potentially litigation over AI-assisted security failures - Connects to broader debate over LLM capability measurement reliability ### Connections - Relevant to OpenAI GPT-5.4-Cyber, Anthropic Claude Code, and AI-enabled autonomous cyberattack narratives - GitHub security advisory ecosystem is central data source