On Serving Image Classification Models

Presentation: PDF

This paper aims to optimize model inference in interactive applications by reducing the infrastructure costs. It seeks to improve resource utilization, lower costs, and enhance the scalability and responsiveness of model serving systems. The focus is on achieving efficient inference in computer vision but has potential applications in other domains. The study involved experiments using a single GPU to analyze the impact of input image size and mini-batch size on request delivery time for image classification. Key findings include a model to estimate GPU warm-up time based on four parameters, the ratification of the existence of a linear relationship between mini-batch size and inference given one particular model, and the need to consider input size when selecting mini-batch size to avoid GPU crashes. Additionally, two mathematical models are proposed for further exploration using optimization algorithms. We also motivate the need to develop a more comprehensive mathematical model for soft and relaxed inference model serving.