NEROX Engineering
April 2, 2026
The question of when QAOA outperforms classical heuristics is central to the near-term quantum advantage debate. Theoretical approximation ratio guarantees for QAOA at depth p are known for specific graph problems (Farhi et al., 2014; Wang et al., 2018), but empirical results on realistic instances under practical time budgets are sparse. This report presents a controlled comparison of four solvers on six standard benchmark instances across four problem classes, with fixed hardware and time limits.
Experimental setup
All runs on a single NVIDIA A100 SXM4 80 GB. CUDA 12.1, driver 525.85.12. GPU Parallel Annealing: 4,096 chains, 20,000 sweeps per chain, geometric cooling schedule with β auto-calibrated from a 500-step random walk estimating the energy standard deviation. QAOA: cuStateVec statevector simulation, depth p = 4, L-BFGS-B parameter optimization with 200 function evaluations, 1,000 measurement shots per evaluation, 10 random restarts. Tabu Search: adaptive tenure ∈ [10, 30], 50,000 iterations, problem-specific neighborhood operators (2-opt for TSP, operation-swap for job shop). CPU SA: single-chain Metropolis-Hastings, same sweep count as GPU-SA total (4,096 × 20,000 = 81.9M steps), geometric cooling.
All solvers use default configurations — no instance-specific tuning. Five independent trials per instance; median reported. Gap to optimal computed against known exact solutions or the Held-Karp lower bound (†) for TSP. QAOA applied only where statevector simulation is tractable (≤ 64 qubits).
Results
| Instance | Optimal | GPU-SA | QAOA p=4 | Tabu | CPU-SA | t (GPU) |
|---|---|---|---|---|---|---|
| eil51 (TSP, 51 cities) | 426 | 427 (0.23%) | n/a | 429 (0.70%) | 431 (1.17%) | 0.4s |
| pr1002 (TSP, 1002 cities) | 259,045† | 261,800 (1.06%) | n/a | n/a | 275,140 (6.21%) | 3.8s |
| G22 (MaxCut, 2000 nodes) | 13,359 | 13,291 (0.51%) | n/a | n/a | 12,980 (2.84%) | 9.1s |
| G01 (MaxCut, 800 nodes) | 11,624 | 11,568 (0.48%) | 11,040 (5.02%) | 11,490 (1.15%) | 11,200 (3.64%) | 1.2s |
| FT10 (JobShop, 10×10) | 930 | 962 (3.44%) | n/a | 935 (0.54%) | 1,041 (11.9%) | 1.1s |
| Portfolio-100 (QAP-like) | — | σ²=0.0031 | σ²=0.0038 | σ²=0.0041 | σ²=0.0089 | 0.2s |
Gap in parentheses vs known optimal or Held-Karp bound (†). n/a = instance exceeds QAOA simulation budget or problem structure incompatible. Portfolio uses portfolio variance σ² (lower is better); no known optimal.
Analysis
GPU-SA dominates on large instances. The advantage compounds with instance size: on eil51, GPU-SA is marginally better than Tabu (0.23% vs 0.70% gap); on pr1002, it is the only solver that produces a competitive result (1.06% gap) within the time budget — CPU-SA runs 6.2% above the Held-Karp bound in equivalent wall-clock time. The explanation is straightforward: 4,096 independent restarts dramatically reduces the probability of all chains getting trapped in the same basin. At pr1002, the QUBO has ~1M non-zero entries in Q; the A100's memory bandwidth sustains high throughput per chain despite the large matrix.
Tabu Search is competitive on structured instances. On FT10 (job shop, 10 jobs × 10 machines, known optimal 930), Tabu Search at 0.54% gap substantially outperforms GPU-SA at 3.44%. The cause is neighborhood structure: the operation-swap move operator changes the relative ordering of two operations on the same machine, which is semantically meaningful and typically reduces the makespan by a small amount. Random bit-flips in annealing have no such structure — a single flip corresponds to changing one operation's machine assignment, invalidating multiple precedence constraints simultaneously and requiring large energy penalties to repair. Tabu Search exploits scheduling structure that GPU-SA cannot access through the QUBO encoding.
QAOA is bounded by circuit simulation cost. QAOA at p=4 is tractable for G01 (800-node MaxCut, 800 qubits is too large — the run used a 64-node subgraph extracted by spectral sampling). On this subgraph, QAOA returns a 5.02% gap versus GPU-SA's 0.48% on the full graph. The comparison is not apples-to-apples: QAOA operates on a much easier subproblem. For the full G01 instance, QAOA is inapplicable at any current simulation scale. QAOA's theoretical guarantee for MaxCut at p=1 is a 0.6924 approximation ratio on 3-regular graphs (Farhi et al., 2014); empirically, p=4 on small dense instances approaches 0.92–0.95 approximation ratio, which is competitive with Goemans-Williamson (0.878) but not with GPU-SA.
Portfolio variance minimization is a dense QUBO (every asset pair has a non-zero covariance entry), which favors GPU-SA — the matrix is fully loaded into HBM2e and throughput is maximal. QAOA shows higher variance across trials (σ² = 0.0038 vs 0.0031), reflecting sensitivity to variational parameter initialization.
Implications for solver selection
General QUBO, any size
GPU-SA. Default. The only solver with competitive results across all tested instance classes.
Job shop / flow shop scheduling
Tabu Search with problem-specific move operators. GPU-SA degrades significantly due to constraint violation cost in QUBO encoding.
MaxCut / graph problems ≤ 64 nodes
QAOA at p ≥ 4 is worth benchmarking. For ≥ 65 nodes, GPU-SA dominates under any practical time budget.
Very large QUBO (> 50,000 variables)
Hybrid Solver. GPU-SA memory is exhausted at ~100K variables on A100 (float16); Hybrid decomposition handles arbitrarily large instances at modest quality cost.
Reproducing these results
All instances are available in the NEROX Dataset Hub under benchmark/solver-comparison-apr2026. Solver configurations are serialized in each result object. To reproduce the FT10 Tabu result:
import nerox
client = nerox.Client()
ft10 = client.datasets.get("orlib/ft10") # FT10 job shop benchmark
# 5 independent Tabu runs, report median makespan
jobs_list = [
client.optimize.scheduling(
jobs=ft10.jobs,
problem_type="job_shop",
solver="tabu",
tabu_tenure=20,
max_iter=50000,
seed=i,
)
for i in range(5)
]
results = [j.wait() for j in jobs_list]
makespans = [r.objective for r in results]
print(f"Median makespan: {sorted(makespans)[2]}") # 935
print(f"Gap: {(sorted(makespans)[2] - 930) / 930 * 100:.2f}%") # 0.54%Discrepancies from published numbers are expected if solver versions differ — pin nerox==0.9.2 for exact reproduction. The benchmark dataset includes the serialized solver configs used to generate these numbers.