Technical OverviewMarch 15, 2026 · 8 min read

NEROX: GPU-Native QUBO Optimization
at Production Scale

N

NEROX Team

March 15, 2026

NEROX is a solver platform for Quadratic Unconstrained Binary Optimization (QUBO) and Ising-model problems. It exposes a REST API and Python SDK that accept problem instances in standard formats and dispatch them to a stack of GPU-accelerated solvers: parallel simulated annealing, Tabu Search, a hybrid decomposition solver, QAOA via statevector simulation, and VQE. No reformulation layer, no abstraction tax — the solver receives the Q matrix directly and returns a binary assignment vector.

Parallel GPU Annealing

The primary solver runs Metropolis-Hastings simulated annealing across 4,096 independent Markov chains simultaneously on a single NVIDIA A100. Each chain maintains its own binary state vector and temperature schedule. At each sweep, all chains perform n single-variable flip proposals in parallel using CUDA warp-level parallelism, accepting or rejecting via the Boltzmann criterion. Chains do not communicate — the ensemble is embarrassingly parallel.

The energy evaluation bottleneck is a dense matrix-vector product: ΔE for a flip at position k is 2·x_k·(Q·x)_k, requiring a single row of Q. At n = 5,000 variables, an A100's 2 TB/s HBM2e bandwidth supports ~40M energy evaluations per second per chain, or ~160B evaluations per second across 4,096 chains. Cooling schedules are calibrated per problem using an initial random walk phase to estimate the energy distribution width, then geometric annealing from β_start to β_end over n_sweeps steps.

On TSPLIB instance pr1002 (1,002 cities), GPU Annealing returns a tour within 1.1% of the Held-Karp lower bound in 3.8 seconds on a single A100. Optimal-gap statistics over 10 independent runs: μ = 1.08%, σ = 0.09%.

Solver stack

All five solvers are available on the same API endpoint via the solver parameter:

  • solver="gpu"GPU Parallel Annealing

    4,096 chains on A100/H100. Default for all QUBO problems ≤ 50,000 variables. Median gap < 2% on TSPLIB, QPLIB, and OR-Library scheduling benchmarks.

  • solver="tabu"Tabu Search

    Adaptive-tenure memory-guided local search. Outperforms annealing on structured scheduling problems where neighborhood moves have strong domain semantics. Uses problem-type-specific move operators when available.

  • solver="hybrid"Hybrid Decomposition

    Variable interaction graph clustering followed by parallel GPU subproblem solving and solution stitching via large neighborhood search. Handles instances up to ~1M variables via multi-pass decomposition.

  • solver="qaoa"QAOA

    Variational quantum circuit simulation via cuStateVec. Supports up to 64 qubits (tensor network backend) at circuit depth p ≤ 12. Variational parameters optimized with L-BFGS-B.

  • solver="vqe"VQE

    Variational Quantum Eigensolver for Hamiltonian ground-state problems. Supports UCCSD, hardware-efficient, and HVA ansätze up to ~50 qubits. Intended for quantum chemistry and physics Hamiltonians.

Hybrid decomposition

For problems exceeding single-GPU memory (80 GB on A100 supports dense Q matrices up to ~100,000 variables in float16), the Hybrid Solver partitions the variable interaction graph into overlapping subgraphs using a spectral clustering heuristic. Each subgraph is solved independently on a GPU worker; boundary variable assignments are fixed and propagated between subgraphs between passes. This is equivalent to a parallel large neighbourhood search where each neighbourhood corresponds to a variable cluster.

Decomposition introduces a boundary error proportional to the inter-cluster coupling strength divided by the intra-cluster coupling density. For sparse real-world instances (road networks, supply chain graphs, protein contact maps), this ratio is small and the quality loss from decomposition is typically under 1% relative to direct solve.

API and deployment

The cloud API is available at https://driftrail.com/nerox/api. Jobs are submitted via POST to /v1/jobs with the Q matrix serialized as a JSON array or scipy sparse COO format. Results are retrieved by polling /v1/jobs/{job_id} or via WebSocket streaming on /v1/jobs/{job_id}/stream.

On-premise deployment ships as a Docker image (registry.driftrail.com/nerox-solver:latest) requiring NVIDIA Container Toolkit and CUDA ≥ 12.1. The solver container exposes the same REST API as the cloud service, enabling identical client code across environments. A Kubernetes StatefulSet configuration and Helm chart are available in the deployment documentation.

Problem types and input formats

The API accepts raw QUBO matrices (dense numpy or scipy sparse COO/CSR) and structured problem inputs for named problem types — TSP (distance matrix), CVRP (distance matrix + demands + capacity), portfolio (covariance matrix + return vector), job shop scheduling (operation sequence arrays), MaxCut (adjacency matrix), and bin packing (weight arrays). Structured inputs are automatically converted to QUBO internally using published penalty formulations; the raw Q matrix is available in the response for inspection.

Benchmark instances from TSPLIB, QPLIB, the OR-Library, and the Biq Mac Library are available via client.datasets.get("tsplib/eil51") — no manual download required. Known optimal values are attached to each dataset and used to auto-compute result.gap_to_best.

Quickstart

Install the SDK and submit your first QUBO in under 5 minutes.

Read the docs