Kubernetes Integration

Prerequisites

You need a Kubernetes cluster with a GPU node pool and the NVIDIA device plugin installed. NEROX is tested on GKE (NVIDIA A100 node pools), EKS (p4d instances), and AKS (NC A100 v4-series).

Deploy the solver StatefulSet

yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nerox-solver
  namespace: nerox
spec:
  replicas: 2
  serviceName: nerox-solver
  selector:
    matchLabels:
      app: nerox-solver
  template:
    metadata:
      labels:
        app: nerox-solver
    spec:
      containers:
        - name: solver
          image: registry.driftrail.com/nerox-solver:latest
          ports:
            - containerPort: 8080
            - containerPort: 9090   # metrics
          env:
            - name: NEROX_LICENSE_KEY
              valueFrom:
                secretKeyRef:
                  name: nerox-secrets
                  key: license-key
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: "96Gi"
            requests:
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 10

Service and Ingress

yaml

apiVersion: v1
kind: Service
metadata:
  name: nerox-solver
  namespace: nerox
spec:
  selector:
    app: nerox-solver
  ports:
    - name: api
      port: 80
      targetPort: 8080
    - name: metrics
      port: 9090
      targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nerox-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
  rules:
    - host: solver.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: nerox-solver
                port:
                  number: 80

Prometheus ServiceMonitor

yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nerox-solver
  namespace: nerox
spec:
  selector:
    matchLabels:
      app: nerox-solver
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Track nerox_jobs_total, nerox_job_duration_seconds, and nerox_gpu_utilization_percent in Grafana. Alert when GPU utilization drops below 60% (underloaded) or when job queue depth exceeds your SLA limit.