How it works
Simulated annealing explores the solution space by making random moves (bit flips in QUBO) and accepting worse solutions with probability exp(-ΔE/T). Temperature T is annealed from high (random walk) to low (local optimization). On a GPU, NEROX runs thousands of these chains simultaneously — each starting from a different random initial state — and returns the best solution found across all chains.
The GPU advantage is not just parallelism but memory bandwidth. A single QUBO energy evaluation requires reading O(n) matrix entries per bit flip. NVIDIA A100 HBM2e provides 2 TB/s memory bandwidth, enabling millions of energy evaluations per second per chain.
Usage
import nerox
client = nerox.Client()
job = client.optimize.qubo(
Q=Q_matrix,
solver="gpu",
n_runs=1024, # number of parallel chains (default 256)
n_sweeps=20000, # steps per chain (default 10000)
beta_start=0.1, # initial inverse temperature
beta_end=5.0, # final inverse temperature
seed=42, # reproducibility
)
result = job.wait()
print(result.solution, result.objective)Parameter guide
n_runs256Independent chains. More runs → better solutions, linear cost increase.n_sweeps10,000Steps per chain. Increase for rugged landscapes with many local minima.beta_startautoStarting inverse temperature. Auto-set from Q matrix scale.beta_endautoFinal inverse temperature. Higher = more exploitation at the end.time_limit_sNoneWall-clock limit. Solver returns best found when limit is hit.Hardware
All GPU Annealing jobs run on NVIDIA A100 80 GB or H100 SXM5 instances provisioned on-demand. GPU-seconds are billed from job start to completion. Typical A100 throughput: ~50M bit-flip evaluations per second per chain at n=1,000 variables.
When to use
GPU Annealing is the recommended default solver for all QUBO problems from 10 to 50,000 variables. Switch to Hybrid Solver above 50,000 variables, or to Tabu Search when your problem has strong domain structure that a neighborhood heuristic exploits better than random bit flips.
