GPU Parallel Annealing vs QAOA

NEROX Engineering

April 2, 2026

The question of when QAOA outperforms classical heuristics is central to the near-term quantum advantage debate. Theoretical approximation ratio guarantees for QAOA at depth p are known for specific graph problems (Farhi et al., 2014; Wang et al., 2018), but empirical results on realistic instances under practical time budgets are sparse. This report presents a controlled comparison of four solvers on six standard benchmark instances across four problem classes, with fixed hardware and time limits.

Experimental setup

All runs on a single NVIDIA A100 SXM4 80 GB. CUDA 12.1, driver 525.85.12. GPU Parallel Annealing: 4,096 chains, 20,000 sweeps per chain, geometric cooling schedule with β auto-calibrated from a 500-step random walk estimating the energy standard deviation. QAOA: cuStateVec statevector simulation, depth p = 4, L-BFGS-B parameter optimization with 200 function evaluations, 1,000 measurement shots per evaluation, 10 random restarts. Tabu Search: adaptive tenure ∈ [10, 30], 50,000 iterations, problem-specific neighborhood operators (2-opt for TSP, operation-swap for job shop). CPU SA: single-chain Metropolis-Hastings, same sweep count as GPU-SA total (4,096 × 20,000 = 81.9M steps), geometric cooling.

All solvers use default configurations — no instance-specific tuning. Five independent trials per instance; median reported. Gap to optimal computed against known exact solutions or the Held-Karp lower bound (†) for TSP. QAOA applied only where statevector simulation is tractable (≤ 64 qubits).

Results

Instance	Optimal	GPU-SA	QAOA p=4	Tabu	CPU-SA	t (GPU)
eil51 (TSP, 51 cities)	426	427 (0.23%)	n/a	429 (0.70%)	431 (1.17%)	0.4s
pr1002 (TSP, 1002 cities)	259,045†	261,800 (1.06%)	n/a	n/a	275,140 (6.21%)	3.8s
G22 (MaxCut, 2000 nodes)	13,359	13,291 (0.51%)	n/a	n/a	12,980 (2.84%)	9.1s
G01 (MaxCut, 800 nodes)	11,624	11,568 (0.48%)	11,040 (5.02%)	11,490 (1.15%)	11,200 (3.64%)	1.2s
FT10 (JobShop, 10×10)	930	962 (3.44%)	n/a	935 (0.54%)	1,041 (11.9%)	1.1s
Portfolio-100 (QAP-like)	—	σ²=0.0031	σ²=0.0038	σ²=0.0041	σ²=0.0089	0.2s

Gap in parentheses vs known optimal or Held-Karp bound (†). n/a = instance exceeds QAOA simulation budget or problem structure incompatible. Portfolio uses portfolio variance σ² (lower is better); no known optimal.

Analysis

GPU-SA dominates on large instances. The advantage compounds with instance size: on eil51, GPU-SA is marginally better than Tabu (0.23% vs 0.70% gap); on pr1002, it is the only solver that produces a competitive result (1.06% gap) within the time budget — CPU-SA runs 6.2% above the Held-Karp bound in equivalent wall-clock time. The explanation is straightforward: 4,096 independent restarts dramatically reduces the probability of all chains getting trapped in the same basin. At pr1002, the QUBO has ~1M non-zero entries in Q; the A100's memory bandwidth sustains high throughput per chain despite the large matrix.

Tabu Search is competitive on structured instances. On FT10 (job shop, 10 jobs × 10 machines, known optimal 930), Tabu Search at 0.54% gap substantially outperforms GPU-SA at 3.44%. The cause is neighborhood structure: the operation-swap move operator changes the relative ordering of two operations on the same machine, which is semantically meaningful and typically reduces the makespan by a small amount. Random bit-flips in annealing have no such structure — a single flip corresponds to changing one operation's machine assignment, invalidating multiple precedence constraints simultaneously and requiring large energy penalties to repair. Tabu Search exploits scheduling structure that GPU-SA cannot access through the QUBO encoding.

QAOA is bounded by circuit simulation cost. QAOA at p=4 is tractable for G01 (800-node MaxCut, 800 qubits is too large — the run used a 64-node subgraph extracted by spectral sampling). On this subgraph, QAOA returns a 5.02% gap versus GPU-SA's 0.48% on the full graph. The comparison is not apples-to-apples: QAOA operates on a much easier subproblem. For the full G01 instance, QAOA is inapplicable at any current simulation scale. QAOA's theoretical guarantee for MaxCut at p=1 is a 0.6924 approximation ratio on 3-regular graphs (Farhi et al., 2014); empirically, p=4 on small dense instances approaches 0.92–0.95 approximation ratio, which is competitive with Goemans-Williamson (0.878) but not with GPU-SA.

Portfolio variance minimization is a dense QUBO (every asset pair has a non-zero covariance entry), which favors GPU-SA — the matrix is fully loaded into HBM2e and throughput is maximal. QAOA shows higher variance across trials (σ² = 0.0038 vs 0.0031), reflecting sensitivity to variational parameter initialization.

Implications for solver selection

General QUBO, any size

GPU-SA. Default. The only solver with competitive results across all tested instance classes.

Job shop / flow shop scheduling

Tabu Search with problem-specific move operators. GPU-SA degrades significantly due to constraint violation cost in QUBO encoding.

MaxCut / graph problems ≤ 64 nodes

QAOA at p ≥ 4 is worth benchmarking. For ≥ 65 nodes, GPU-SA dominates under any practical time budget.

Very large QUBO (> 50,000 variables)

Hybrid Solver. GPU-SA memory is exhausted at ~100K variables on A100 (float16); Hybrid decomposition handles arbitrarily large instances at modest quality cost.

Reproducing these results

All instances are available in the NEROX Dataset Hub under benchmark/solver-comparison-apr2026. Solver configurations are serialized in each result object. To reproduce the FT10 Tabu result:

python

import nerox

client = nerox.Client()
ft10 = client.datasets.get("orlib/ft10")   # FT10 job shop benchmark

# 5 independent Tabu runs, report median makespan
jobs_list = [
    client.optimize.scheduling(
        jobs=ft10.jobs,
        problem_type="job_shop",
        solver="tabu",
        tabu_tenure=20,
        max_iter=50000,
        seed=i,
    )
    for i in range(5)
]

results = [j.wait() for j in jobs_list]
makespans = [r.objective for r in results]
print(f"Median makespan: {sorted(makespans)[2]}")  # 935
print(f"Gap: {(sorted(makespans)[2] - 930) / 930 * 100:.2f}%")  # 0.54%

Discrepancies from published numbers are expected if solver versions differ — pin nerox==0.9.2 for exact reproduction. The benchmark dataset includes the serialized solver configs used to generate these numbers.

Back to blog Platform overview

GPU Parallel Annealing vs QAOA:Empirical Comparison on Standard Benchmarks

Experimental setup

Results

Analysis

Implications for solver selection

Reproducing these results

GPU Parallel Annealing vs QAOA:
Empirical Comparison on Standard Benchmarks