GPU Tail Latency Diagnosis for Serverless and HPC Workloads using eBPF

Serverless and elastic GPU platforms for AI inference face a critical observability challenge: diagnosing unpredictable tail latency spikes in multi-tenant cloud environments. In these settings, users operate within virtual machines without access to cluster-wide fabric or network infrastructure, lacking visibility into the underlying hardware topology and host-level resource contention. Traditional GPU-centric monitoring tools like NVML fail to capture the broader system dynamics that cause performance anomalies, leaving developers unable to debug production issues effectively.