Parity Twine Chain QAOA — Linear-Depth Phase Separators

NEROX Quantum

April 21, 2026

A dense QUBO with n variables has n(n−1)/2 coupling terms. Compiling the QAOA phase-separator e^−iγH_C to a 1D qubit bus with the naive approach — emit one CX·Rz·CX sequence per edge, route with SWAPs — produces circuit depth that scales as O(n²). On a simulator this dominates wall-clock time; on real hardware it dominates decoherence. Starting in v0.9.3, the NEROX QAOA solver ships an alternative: the Parity Twine Chain (PTC) ansatz, which compiles the same phase-separator to O(n) two-qubit depth with no SWAP overhead.

The fused SWAP·ZZ brick

PTC replaces the standard SWAP-network pattern — 3 CNOTs for the swap plus 2 CNOTs wrapping an Rz for the ZZ rotation, totalling 5 CNOTs — with a single fused 3-CNOT block that performs both operations at once:

PTC brick

CX(a, b)             # entangle parity onto b
Rz(2·γ·J_ab, b)      # phase rotation
CX(b, a)             # propagate parity back
CX(a, b)             # complete the swap

# Equivalent unitary:  SWAP(a,b) · exp(-i·γ·J_ab·Z_a·Z_b)

Inside a QAOA layer the bricks are tiled in a brick-wall schedule: odd and even adjacent pairs on the bus are operated on in alternating two-qubit-depth-3 rounds. After n−1 rounds every logical pair has been adjacent exactly once; the first and last rounds replace the brick with a plain Rzz (no swap needed at the boundaries). The resulting two-qubit depth is 2 + 3·(n−2) + 2 = 3n − 2 per QAOA layer for a fully-connected problem.

Resource comparison

The table below compares the current default NEROX phase-separator (ansatz="linear") against the new PTC path (ansatz="parity_twine") on a fully-connected QUBO. Numbers are from scripts/bench_parity_twine.py, single QAOA layer, dense coupling.

n	Naive CX	Naive depth	PTC CX	PTC depth	Depth ratio
4	12	12	15	10	83.3%
8	56	56	77	22	39.3%
16	240	240	345	46	19.2%
32	992	992	1,457	94	9.5%
64	4,032	4,032	5,985	190	4.7%
120	14,280	14,280	21,301	358	2.5%

Depth ratio = PTC depth ÷ naive depth (serialized). At n=120, PTC reaches 2.5% of the baseline depth.

Two observations worth flagging. First, PTC emits more total CX gates (~1.49× naive at n=120) — the three CX bricks accumulate. The depth reduction comes from parallelism: disjoint bricks across the bus execute simultaneously, whereas the naive schedule serializes one CX·Rz·CX pair at a time on a 1D bus. Second, the naive NEROX kernel assumes free all-to-all connectivity on the CUDA-Q simulator and skips routing — so the CX count above is a lower bound for hardware execution. On a real topology the naive schedule would need SWAP insertion that blows up to O(n²) extra CX; PTC's constant-factor overhead is the correct tradeoff for any hardware-aware deployment.

When to turn it on

Dense QUBO, n ≥ 10

Enable PTC. Portfolio, Community Detection, MaxCut on dense graphs, and TSP (density 100%) see the full depth benefit. The solver now emits a warning suggesting the ansatz when density > 0.5.

Sparse QUBO (density < 10%)

Stay on the linear ansatz. The brick-wall schedule still cycles every logical pair adjacent once — most of the bricks do no useful work because J_ij = 0. Sparse graphs benefit more from the naive edge enumeration.

Shallow QAOA (p ≤ 2)

Marginal gains. Depth is already small; the linear-O(n) reduction matters most at deeper p or larger n.

Hardware backends (future)

Always enable PTC. The 1D bus scheduling is directly compatible with heavy-hex and linear topologies — no SWAP network needed.

Using the new ansatz

The option is exposed as a single keyword on the existing QAOA solver — no API break, no new module to import.

python

from packages.solvers.qaoa import QAOASolver

# Portfolio QUBO: dense covariance matrix, n=24 assets
solver = QAOASolver(
    n_layers=4,
    ansatz="parity_twine",   # ← new in v0.9.3
    target="nvidia",
)
result = solver.solve(qubo)

print(result.metadata["ansatz"])   # "parity_twine"
print(result.energy, result.runtime)

The default remains ansatz="linear" — existing jobs and saved experiments run unchanged. Direct-to-API users set solver_config: { ansatz: "parity_twine" } in the job payload.

Implementation notes

The schedule generator lives in packages/solvers/parity_twine.py — pure Python, zero dependencies, importable without CUDA-Q. Three entry points: brick_wall_pairs(n) returns the round-robin pair schedule, ptc_gate_list() emits the flat gate sequence for inspection or alternative backends, and count_ptc_resources(n) returns resource estimates without building the circuit. The CUDA-Q kernel build_qaoa_kernel_ptc() in gpu/cudaq_bridge.py bakes the schedule at build time and parameterises only γ and β at run time.

The construction is ported from Monbroussou et al., "Efficient Circuit Transpilation for QAOA via Parity Twine Chains" (arXiv:2505.17944). The same family of parity-propagation techniques was used by ParityQC and IBM for the 52-qubit Quantum Fourier Transform benchmark on Heron r3 (April 2026).

Validation

We verify two invariants: every logical pair (i, j) is brought adjacent exactly once across the n rounds — scripts/bench_parity_twine.py asserts this for n up to 120 — and the gate list decomposes to exactly 3 CNOTs + 1 Rz per inner brick, 2 CNOTs + 1 Rz per boundary Rzz, with the correct Rz angle. Solution quality is unchanged from the naive ansatz on small test cases (n=4, 8, 16) within COBYLA's parameter optimisation variance; the full end-to-end benchmark on dense MaxCut and portfolio instances will ship with v0.9.4.

What's next

Three near-term follow-ons. First, UnifiedSolver will auto-select PTC for dense problems — today the ansatz is a manual toggle. Second, a PTC variant for sparse graphs that skips bricks with zero coupling is straightforward and lands in v0.9.4. Third, the scheduling logic is decoupled from the CUDA-Q backend — it lifts cleanly to cuQuantum and eventually to hardware-facing targets when NEROX wires real QPU execution.

Back to blog GPU vs QAOA benchmarks

Parity Twine Chain QAOA:Linear-Depth Phase Separators for Dense QUBOs