Prerequisites
You need a Kubernetes cluster with a GPU node pool and the NVIDIA device plugin installed. NEROX is tested on GKE (NVIDIA A100 node pools), EKS (p4d instances), and AKS (NC A100 v4-series).
Deploy the solver StatefulSet
yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: nerox-solver
namespace: nerox
spec:
replicas: 2
serviceName: nerox-solver
selector:
matchLabels:
app: nerox-solver
template:
metadata:
labels:
app: nerox-solver
spec:
containers:
- name: solver
image: registry.driftrail.com/nerox-solver:latest
ports:
- containerPort: 8080
- containerPort: 9090 # metrics
env:
- name: NEROX_LICENSE_KEY
valueFrom:
secretKeyRef:
name: nerox-secrets
key: license-key
resources:
limits:
nvidia.com/gpu: "1"
memory: "96Gi"
requests:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 10Service and Ingress
yaml
apiVersion: v1
kind: Service
metadata:
name: nerox-solver
namespace: nerox
spec:
selector:
app: nerox-solver
ports:
- name: api
port: 80
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: nerox-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
rules:
- host: solver.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: nerox-solver
port:
number: 80Prometheus ServiceMonitor
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nerox-solver
namespace: nerox
spec:
selector:
matchLabels:
app: nerox-solver
endpoints:
- port: metrics
interval: 30s
path: /metricsTrack nerox_jobs_total, nerox_job_duration_seconds, and nerox_gpu_utilization_percent in Grafana. Alert when GPU utilization drops below 60% (underloaded) or when job queue depth exceeds your SLA limit.
