Entity
N-Day-Bench – LLM Vulnerability Detection Benchmark
N-Day-Bench is a monthly-refreshing benchmark that tests whether frontier LLMs can autonomously find known security vulnerabilities in real codebases, using GitHub security advisories and sandboxed execution environments. Its rolling refresh design attempts to combat training data contamination that plagues static security benchmarks.
Importance: 65%Confidence: 78%Mentions: 1Updated: April 27, 2026
## N-Day-Bench – LLM Vulnerability Detection Benchmark
### Overview
N-Day-Bench is a benchmark testing whether frontier large language models (LLMs) can autonomously discover known security vulnerabilities in real-world codebases (winfunc.com). The project addresses a recognized limitation of static vulnerability benchmarks: rapid contamination of training data causes models to memorize rather than reason.
### Methodology
- **Monthly refresh**: New cases are pulled each month from GitHub security advisories (winfunc.com)
- **Realistic conditions**: Repos are checked out at the last commit *before* the patch, giving models access to vulnerable code without the fix
- **Sandboxed execution**: Models are given a bash shell to actively explore the codebase
- **Multi-agent architecture**: Each case reportedly runs three agents — a Curator plus additional roles (winfunc.com)
### Why It Matters
**Contamination problem**: Static benchmarks become unreliable as cases leak into LLM training data. N-Day-Bench's rolling refresh attempts to keep the evaluation set ahead of contamination, or at minimum makes the contamination window transparent (winfunc.com).
**Legal & security implications**: As LLMs are increasingly deployed in security workflows, reliable benchmarks for vulnerability detection capability are essential for:
- Procurement decisions by security teams
- Liability analysis when AI-assisted security review misses known vulnerabilities
- Regulatory compliance assessments
### Strategic Relevance
- Directly relevant to the emerging AI cybersecurity tool market
- Benchmarks of this type will likely be cited in procurement RFPs, insurance underwriting (cf. Cowbell Prime One), and potentially litigation over AI-assisted security failures
- Connects to broader debate over LLM capability measurement reliability
### Connections
- Relevant to OpenAI GPT-5.4-Cyber, Anthropic Claude Code, and AI-enabled autonomous cyberattack narratives
- GitHub security advisory ecosystem is central data source