NEROX Quantum
April 21, 2026
A dense QUBO with n variables has n(n−1)/2 coupling terms. Compiling the QAOA phase-separator e−iγHC to a 1D qubit bus with the naive approach — emit one CX·Rz·CX sequence per edge, route with SWAPs — produces circuit depth that scales as O(n²). On a simulator this dominates wall-clock time; on real hardware it dominates decoherence. Starting in v0.9.3, the NEROX QAOA solver ships an alternative: the Parity Twine Chain (PTC) ansatz, which compiles the same phase-separator to O(n) two-qubit depth with no SWAP overhead.
The fused SWAP·ZZ brick
PTC replaces the standard SWAP-network pattern — 3 CNOTs for the swap plus 2 CNOTs wrapping an Rz for the ZZ rotation, totalling 5 CNOTs — with a single fused 3-CNOT block that performs both operations at once:
CX(a, b) # entangle parity onto b Rz(2·γ·J_ab, b) # phase rotation CX(b, a) # propagate parity back CX(a, b) # complete the swap # Equivalent unitary: SWAP(a,b) · exp(-i·γ·J_ab·Z_a·Z_b)
Inside a QAOA layer the bricks are tiled in a brick-wall schedule: odd and even adjacent pairs on the bus are operated on in alternating two-qubit-depth-3 rounds. After n−1 rounds every logical pair has been adjacent exactly once; the first and last rounds replace the brick with a plain Rzz (no swap needed at the boundaries). The resulting two-qubit depth is 2 + 3·(n−2) + 2 = 3n − 2 per QAOA layer for a fully-connected problem.
Resource comparison
The table below compares the current default NEROX phase-separator (ansatz="linear") against the new PTC path (ansatz="parity_twine") on a fully-connected QUBO. Numbers are from scripts/bench_parity_twine.py, single QAOA layer, dense coupling.
| n | Naive CX | Naive depth | PTC CX | PTC depth | Depth ratio |
|---|---|---|---|---|---|
| 4 | 12 | 12 | 15 | 10 | 83.3% |
| 8 | 56 | 56 | 77 | 22 | 39.3% |
| 16 | 240 | 240 | 345 | 46 | 19.2% |
| 32 | 992 | 992 | 1,457 | 94 | 9.5% |
| 64 | 4,032 | 4,032 | 5,985 | 190 | 4.7% |
| 120 | 14,280 | 14,280 | 21,301 | 358 | 2.5% |
Depth ratio = PTC depth ÷ naive depth (serialized). At n=120, PTC reaches 2.5% of the baseline depth.
Two observations worth flagging. First, PTC emits more total CX gates (~1.49× naive at n=120) — the three CX bricks accumulate. The depth reduction comes from parallelism: disjoint bricks across the bus execute simultaneously, whereas the naive schedule serializes one CX·Rz·CX pair at a time on a 1D bus. Second, the naive NEROX kernel assumes free all-to-all connectivity on the CUDA-Q simulator and skips routing — so the CX count above is a lower bound for hardware execution. On a real topology the naive schedule would need SWAP insertion that blows up to O(n²) extra CX; PTC's constant-factor overhead is the correct tradeoff for any hardware-aware deployment.
When to turn it on
Dense QUBO, n ≥ 10
Enable PTC. Portfolio, Community Detection, MaxCut on dense graphs, and TSP (density 100%) see the full depth benefit. The solver now emits a warning suggesting the ansatz when density > 0.5.
Sparse QUBO (density < 10%)
Stay on the linear ansatz. The brick-wall schedule still cycles every logical pair adjacent once — most of the bricks do no useful work because J_ij = 0. Sparse graphs benefit more from the naive edge enumeration.
Shallow QAOA (p ≤ 2)
Marginal gains. Depth is already small; the linear-O(n) reduction matters most at deeper p or larger n.
Hardware backends (future)
Always enable PTC. The 1D bus scheduling is directly compatible with heavy-hex and linear topologies — no SWAP network needed.
Using the new ansatz
The option is exposed as a single keyword on the existing QAOA solver — no API break, no new module to import.
from packages.solvers.qaoa import QAOASolver
# Portfolio QUBO: dense covariance matrix, n=24 assets
solver = QAOASolver(
n_layers=4,
ansatz="parity_twine", # ← new in v0.9.3
target="nvidia",
)
result = solver.solve(qubo)
print(result.metadata["ansatz"]) # "parity_twine"
print(result.energy, result.runtime)The default remains ansatz="linear" — existing jobs and saved experiments run unchanged. Direct-to-API users set solver_config: { ansatz: "parity_twine" } in the job payload.
Implementation notes
The schedule generator lives in packages/solvers/parity_twine.py — pure Python, zero dependencies, importable without CUDA-Q. Three entry points: brick_wall_pairs(n) returns the round-robin pair schedule, ptc_gate_list() emits the flat gate sequence for inspection or alternative backends, and count_ptc_resources(n) returns resource estimates without building the circuit. The CUDA-Q kernel build_qaoa_kernel_ptc() in gpu/cudaq_bridge.py bakes the schedule at build time and parameterises only γ and β at run time.
The construction is ported from Monbroussou et al., "Efficient Circuit Transpilation for QAOA via Parity Twine Chains" (arXiv:2505.17944). The same family of parity-propagation techniques was used by ParityQC and IBM for the 52-qubit Quantum Fourier Transform benchmark on Heron r3 (April 2026).
Validation
We verify two invariants: every logical pair (i, j) is brought adjacent exactly once across the n rounds — scripts/bench_parity_twine.py asserts this for n up to 120 — and the gate list decomposes to exactly 3 CNOTs + 1 Rz per inner brick, 2 CNOTs + 1 Rz per boundary Rzz, with the correct Rz angle. Solution quality is unchanged from the naive ansatz on small test cases (n=4, 8, 16) within COBYLA's parameter optimisation variance; the full end-to-end benchmark on dense MaxCut and portfolio instances will ship with v0.9.4.
What's next
Three near-term follow-ons. First, UnifiedSolver will auto-select PTC for dense problems — today the ansatz is a manual toggle. Second, a PTC variant for sparse graphs that skips bricks with zero coupling is straightforward and lands in v0.9.4. Third, the scheduling logic is decoupled from the CUDA-Q backend — it lifts cleanly to cuQuantum and eventually to hardware-facing targets when NEROX wires real QPU execution.