Manifesting the Elasticity of Serverless Data Pipelines: a Metabolomics Use Case

Cloud functions offer a powerful tool for data processing by freeing developers from manual resource provisioning and enabling fine-grained, on-demand scaling. This model is well-suited for data analytics pipelines with dynamic resource requirements, yet demonstrations on complex, production-grade systems exhibiting abrupt, large-scale changes in parallelism remain scarce. This paper addresses this gap by presenting a holistic performance analysis of the METASPACE metabolite annotation pipeline, a real-world, multi-stage data processing workflow. Implemented in a purely serverless environment using the Lithops framework on AWS Lambda and managed via PyRun, our study examines the pipeline’s scaling behavior across varying input data sizes. We confirm that the pipeline exhibits strong elasticity, dynamically adjusting concurrency by orders of magnitude to match workload demands—scaling out to hundreds of functions for data-intensive stages while remaining modest for lighter tasks. Furthermore, we successfully optimize the distributed data ingestion stage, identifying an optimal data chunk size that yields consistent per-stage speedups of up to 1.22x across various datasets. This work serves as a compelling case study, validating the power and adaptability of cloud functions for demanding workflows and providing a robust example of elasticity that moves beyond simpler microbenchmarks.