Pol G. Recasens
UPC&BSC
Chen Wang
IBM
Yue Zhu
IBM
Jordi Torres
UPC&BSC
Josep LLuis Berral
UPC&BSC

Optimizing Inference for Small Models in Memory-Limited GPU Environments

Presentation: PDF

This presentation explores optimizing inference for small models in GPU-limited environments. Contrary to traditional methods, we propose replicating small models on a single GPU for improved efficiency. Our empirical studies support this approach, offering insights into inference strategies for small models in resource-constrained settings.