Scaling up spatial metabolomics with serverless computing and Lithops

Presentation: PDF

Spatial metabolomics is an emerging field of omics used in biology and medicine to obtain information about localization of small molecules, metabolites, and lipids in tissue sections. Understanding spatial organization of metabolic reactions and processes in an organism or organ is critically important, in particular to elucidate metabolic reprogramming of cells in health and diseases such as cancer, diabetes, or immune disease. We have developed a cloud solution for spatial metabolomics, called METASPACE, that integrates an engine for finding molecules in big spatial metabolomics data and a web app for accessing, visualizing, sharing, and publishing the results (http://metaspace2020.eu). The original version of METASPACE was implemented using Apache Spark (https://github.com/metaspace2020). Currently, METASPACE is used by over 800 scientists worldwide who submit 10-100 datasets a day. Spatial metabolomics datasets vary in size: a dataset can be from 2 MB to 64 GB in size with most datasets being between 64 MB and 2 GB in size. Moreover, there is a strong irregularity in the METASPACE workload due to the different patterns of use as well as due to batch submissions through Python API. Overall, these led to a strong demand in scalability.

In order to address it, we implemented METASPACE using the serverless framework Lithops (https://github.com/metaspace2020/Lithops-METASPACE). Lithops is an open-source Python multi-cloud distributed computing framework developed in the EU project CloudButton (https://github.com/lithops-cloud/lithops). We benchmarked Lithops-METASPACE against our Apache Spark version of METASPACE on five datasets of different sizes. For Spark, we used one master node and three slave nodes, each using a c5.4xlarge instance (16 vCPU, 32 GB RAM) on AWS. For Lithops, we used Cloud Functions with a RAM ranging from 256 MB to 4 GB and one instance with 128 GB of RAM on IBM Cloud.

Our comparison showed that the serverless implementation outperformed the Apache Spark implementation. The Lithops implementation performed ~2 times faster for very small datasets, ~20%-50% faster for medium datasets, and showed slightly faster comparable runtime for large datasets, mainly due to the shorter time needed to start containers in Lithops compared to starting the Apache Spark cluster. The parallelization provided when using Lithops added the convenience in managing resources in particular in setting the number of parallel executors and the amount of allocated RAM, by changing respective variables in the code or configuration file. This is one of the benefits of Lithops which is designed to relieve the developer from the need to create cloud functions with specified characteristics (CPU / RAM) in advance. We have evaluated the impact of the number of executors on the runtime for a large dataset (21 GB). For 39, 78, 157, 315, and 788 executors, respectively, the runtimes were 66, 35, 20, 17, and 13 seconds, respectively. As expected, increasing the number of executors allowed us to reduce the runtime. The close-to-linear speed up when using 39, 78, and 157 executors demonstrates near-to-linear scalability whereas for large numbers of executors the speed up is less-than-linear due to the overhead needed for every executor for code initialization, and for downloading and storing data from and to the object storage.

Overall, we found Lithops to be a competitive solution and use it for production of METASPACE on IBM Cloud. With respect to the development time, it was beneficial to have the ease of local development and debugging, since Lithops allows the use of identical and lightweight docker images both for local development and for various types of backends in the cloud (Cloud Function, Code Engine, VPC). Although the functional paradigm of Lithops requires some training for developers, we did not face any major bottlenecks. Having the ability to use various backends depending on the type of task and the amount of required RAM enabled us to plan resources and yet preserve flexibility needed in science. Finally, as a cloud-agnostic framework, Lithops enables deploying code on the cloud which provides best serverless performance and convenience.