A Better Newspaper

## Overview Speculative decoding is an inference-time acceleration technique for large language models that uses a smaller or simplified 'draft' model to propose candidate token sequences, which are then verified in parallel by the full model. Multiple research threads — including KnapSpec (arXiv:2602.20217) and related self-speculative decoding (SSD) variants — are actively advancing the state of the art. ## KnapSpec Contribution KnapSpec reformulates draft model selection as a **knapsack optimization problem**, maximizing tokens-per-time throughput by modeling hardware-specific latencies of Attention and MLP layers as functions of context length (arXiv:2602.20217). This training-free approach reportedly outperforms static layer-skipping heuristics in long-context scenarios, which are increasingly common in enterprise deployments. ## Broader Landscape Speculative decoding variants include: - **Standard speculative decoding:** Separate small draft model + large verifier - **Self-speculative decoding (SSD):** Single model skips layers to generate drafts internally - **Medusa/Hydra:** Multiple decoding heads - **OASIS / LUT-based quantization:** Complementary approach via dual-side quantization (arXiv:2507.23035) ## Strategic Relevance Inference cost is the primary economic bottleneck for deploying frontier LLMs at scale. Any technique that increases tokens/second without quality degradation directly reduces per-query costs and improves margin for API providers and self-hosted deployments. Speculative decoding is increasingly being integrated into production inference stacks (vLLM, TensorRT-LLM, SGLang). ## Commercial Implications - Inference infrastructure vendors (Groq, Cerebras, Fireworks) compete partly on this dimension - Cloud providers (AWS, Azure, GCP) have incentives to license or acquire leading speculative decoding implementations - Patent landscape around specific implementations is active and contested ## Connections - LLM inference acceleration broadly - OASIS dual-side quantization - dLLM-Cache for diffusion LLMs

Speculative Decoding – LLM Inference Acceleration Technique Landscape