GPU Deployment Guide

Prerequisites

You need at least one NVIDIA GPU with CUDA 12.1+ support. The NEROX solver image is tested on A100, H100, RTX 4090, and A10G. A minimum of 24 GB VRAM is recommended for production workloads over 10,000 variables.

Docker 24.0+

NVIDIA Container Toolkit (nvidia-docker2)

CUDA driver ≥ 525.60 (CUDA 12.1 compatible)

Linux x86_64 (Ubuntu 22.04 LTS recommended)

A NEROX Business or Enterprise license key

Install NVIDIA Container Toolkit

bash

# Ubuntu / Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu22.04 nvidia-smi

Pull and run the solver container

bash

# Authenticate with the NEROX registry
echo $NEROX_LICENSE_KEY | docker login registry.driftrail.com -u license --password-stdin

# Pull the solver image
docker pull registry.driftrail.com/nerox-solver:latest

# Run with all GPUs
docker run -d \
  --name nerox-solver \
  --gpus all \
  --restart unless-stopped \
  -p 8080:8080 \
  -e NEROX_LICENSE_KEY=$NEROX_LICENSE_KEY \
  -e NEROX_LOG_LEVEL=info \
  registry.driftrail.com/nerox-solver:latest

Configure the Python client to use your instance

python

import nerox

client = nerox.Client(
    base_url="http://your-server:8080",
    api_key="nrx_sk_..."   # your API key, validated locally by the container
)

job = client.optimize.tsp(distance_matrix=matrix, solver="gpu")
result = job.wait()
print(result.objective)

Docker Compose (recommended for single-node)

yaml

# docker-compose.yml
version: "3.9"

services:
  nerox-solver:
    image: registry.driftrail.com/nerox-solver:latest
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      NEROX_LICENSE_KEY: ${NEROX_LICENSE_KEY}
      NEROX_LOG_LEVEL: info
      NEROX_MAX_CONCURRENT_JOBS: 4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - nerox-data:/var/nerox

  nerox-gateway:
    image: registry.driftrail.com/nerox-gateway:latest
    restart: unless-stopped
    ports:
      - "443:443"
    environment:
      UPSTREAM: http://nerox-solver:8080
      TLS_CERT: /certs/tls.crt
      TLS_KEY: /certs/tls.key
    volumes:
      - ./certs:/certs:ro

volumes:
  nerox-data:

Kubernetes deployment

GPU node pool

Label your GPU nodes and configure the NVIDIA device plugin before deploying. The NEROX solver is a StatefulSet — each replica owns a dedicated GPU.

yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nerox-solver
  namespace: nerox
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nerox-solver
  template:
    metadata:
      labels:
        app: nerox-solver
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-tesla-a100
      containers:
        - name: solver
          image: registry.driftrail.com/nerox-solver:latest
          ports:
            - containerPort: 8080
          env:
            - name: NEROX_LICENSE_KEY
              valueFrom:
                secretKeyRef:
                  name: nerox-license
                  key: key
          resources:
            limits:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10

Environment variables reference

NEROX_LICENSE_KEYRequired. Your Business/Enterprise license key.

NEROX_LOG_LEVELdebug | info | warn | error. Default: info.

NEROX_MAX_CONCURRENT_JOBSMax jobs running in parallel per instance. Default: 2.

NEROX_DATA_DIRPath for job result storage. Default: /var/nerox.

NEROX_TLS_CERTPath to TLS certificate (optional, use gateway instead).

NEROX_METRICS_PORTPrometheus metrics endpoint port. Default: 9090.

Health check and monitoring

bash

# Health endpoint
curl http://localhost:8080/health
# { "status": "ok", "gpus": 2, "jobs_running": 1 }

# Prometheus metrics (scraped by default at :9090/metrics)
# nerox_jobs_total, nerox_job_duration_seconds, nerox_gpu_utilization_percent

# View live GPU utilization
watch -n1 nvidia-smi