Elastic MIG Reconfiguration with PCIe-Aware Placement for Multi-Tenant GPUs

Multi-tenant LLM inference platforms face unpredictable tail latency from noisy-neighbor interference, violating service-level objectives (SLOs) critical for serverless and elastic cloud workloads. We present a VM-deployable controller that enables fine-grained, dynamic resource allocation by combining Multi-Instance GPU (MIG) reconfiguration, PCIe-aware placement, and lightweight isolation guardrails. The controller samples per-tenant tail latency (including time-to-first-token for autoregressive serving), correlates system signals to detect interference, and adaptively adjusts isolation using host-only controls deployable in cloud environments without fabric privileges. Evaluated on vLLM serving OLMo 2 7B Instruct across a 16-GPU cluster, our approach reduces SLO miss-rate by (\approx)32% and improves TTFT p99 by (\approx)10–15% with (\leq)5% throughput cost, demonstrating practical elastic GPU allocation for serverless LLM inference.