This talk explores how IBM Cloud Code Engine redefines serverless computing by uniting scalability and simplicity for modern cloud workloads. As a fully managed, serverless platform, Code Engine enables developers to deploy container images, batch jobs, source code, and functions without the burden of infrastructure management or cluster sizing, allowing them to focus solely on delivering value to customers. The session will highlight how Code Engine’s unified experience supports a diverse range of workloads—from web applications and event-driven functions to large-scale, compute-intensive batch jobs—scaling seamlessly to meet fluctuating demand while offering a pay-for-what-you-use model.
A key focus will be the introduction of the new Serverless Fleets capability, designed to address the growing need for handling large-scale, compute-intensive workloads -- like simulations, financial modeling, Monte Carlo methods, AI training, and batch inferencing -- without managing infrastructure. Leveraging serverless VMs or GPUs with automatic scaling, including scale-to-zero, IBM Cloud Code Engine optimizes resource use and minimizes costs. Customizable machine profiles, secure networking, and integrated storage let developers focus on coding while operational overhead is handled. The session will showcase use cases where serverless architectures simplify parallel, data-intensive computing and ensure scalable, cost-effective performance Serverless Fleets extends Code Engine’s serverless paradigm to high-performance computing scenarios, enabling organizations to run massive simulations and data processing tasks with the same ease and efficiency as traditional serverless applications.
Slides [PDF]
Jeremias Werner, STSM, IBM Cloud
The increasing scale of scientific data necessitates cloud computing, yet the operational complexity of cloud infrastructure and parallel frameworks like Dask and Lithops hinders research productivity. To overcome this, we present PyRun, an integrated platform that abstracts the entire cloud infrastructure lifecycle, enabling scientists to run scalable Python workflows effortlessly.
PyRun provides a unified, web-based IDE where users can code, configure, and execute jobs on their own cloud account. It automates the creation of execution environments from simple dependency files and offers seamless, UI-driven integration with Dask and Lithops. Performance benchmarks demonstrate PyRun's efficiency, showing up to 14.9x lower cost and 5.4x faster execution for Function-as-a-Service (FaaS) workloads, and 1.6x faster end-to-end execution for Dask cluster tasks compared to contemporary solutions. By unifying development and execution, PyRun significantly lowers the barrier to entry for high-performance computing, allowing researchers to focus on science, not cloud infrastructure.
Slides [PDF]
Daniel Alejandro Coll Tejeda
Job scheduling for machine learning (ML) has received significant attention, targeting objectives such
as job completion time, utilization, and fairness through techniques such as heterogeneity, elasticity, and
task merging. However, the quality of the configuration space in which these scheduling policies make
decisions and choose between GPU placements and hyperparameters for jobs is under-explored. We find that
performance between configurations for the same ML fine-tuning job can differ over 40x, and, as a result,
schedulers with under-explored views of the configuration space cannot find those configurations that optimize
their objective, regardless of the quality of the scheduling policy. Hence, we make a case that configuration
knowledge should be treated as a first-class citizen in ML schedulers.
propose CoTune, a scheduling
framework for ML fine-tuning that exposes various configuration knowledge classes, qualities, and quantities
for scheduling policies to integrate and evaluate. As a foundation, we propose a methodology for defining,
gathering, and predicting configuration knowledge, which we apply to build a comprehensive configuration
knowledge database. Using production traces on a simulated GPU cluster, we demonstrate how full configuration
knowledge from our database reduces average and tail job completion time for fine-tuning jobs by 73.9% and
79.0% across policies while decreasing average GPU utilization by 27.9%.
Slides [PDF]
Matthijs Jansen
Serverless computing offers promising scalability and cost-efficiency, yet constructing performant DAGs from monolithic applications remains challenging. We present AutoDAG, an end-to-end automated platform for transforming monolithic code into optimized serverless DAGs across federated clouds. AutoDAG integrates three key components—FaaSifier, Profiler, and Scheduler—to automate function decomposition, data flow analysis, and resource-aware task allocation. Our approach supports profiling for parallelism and dependency inference, reducing function count and data transfers. Evaluated on the Montage workflow in AWS, AutoDAG achieves up to 12.3× speedup and significant cost reduction compared to state-of-the-art tools.
Slides [PDF]
Philipp Gritsch
Resource provisioning in serverless data analytics is critical for performance, with the number of functions per stage directly affecting execution latency. We aim to unveil new mechanisms for resource provisioning in serverless applications. Specifically, we try to develop an analytical model that reduces the profiling time and cost required to model a job while optimizing its execution. We test our approach in a production-ready metabolomics pipeline and demonstrate execution time improvements compared to the state of the art in serverless resource provisioning methods.
Slides [PDF]
Germán T. Eizaguirre
Serverless applications consist of functions written in heterogeneous programming languages, use diverse
data stores and communication services, and evolve rapidly. Consequently, it is challenging for serverless
tenants to protect their application data from inadvertent leaks due to bugs, misconfigurations, and human
errors. Cloud security tools, such as Identity and Access Management (IAM), lack observability into a tenant’s
application, whereas the state-of-the-art dataflow tracking tools require support from the cloud platform and
incur significant runtime overheads.
We present Growlithe, a tool that integrates with the serverless
application development toolchain and enables continuous compliance with data policies by design. Growlithe
allows declarative specification of access and data flow control policies over a language- and
platform-independent dataflow graph abstraction of a serverless application, and enforces these policies
through a combination of static analysis and runtime enforcement.
We used Growlithe with applications
using Python and JavaScript functions that can be hosted on AWS Lambda and Google Cloud Functions platforms.
We empirically demonstrate that Growlithe is cross-cutting, portable and efficient, and enables developers to
easily adapt their applications and policies to evolving requirements.
Slides [PDF]
Praveen Gupta
Cloud and edge computing, particularly serverless Function-as-a-Service (FaaS), offer scalable resources for IoT devices. This research presents an adaptive offloading framework that intelligently selects the best execution environment using pretrained cost estimation models for each function and FaaS platform. These models guide offloading decisions, enabling IoT devices to optimize function execution times. Results show that pretraining significantly reduces calibration time and improves adaptability to changing network conditions. The proposed approach demonstrates improved performance by dynamically selecting the most efficient environment per invocation, ensuring faster, more reliable execution even under network degradation.
Slides [PDF]
Tomasz Szydlo
Fine-grained serverless functions power many new applications that benefit from elastic scaling and pay-as-you-use billing model with minimal infrastructure management overhead. To achieve these properties, Function-as-a-Service (FaaS) platforms disaggregate compute and state and, consequently, introduce non-trivial costs due to the loss of data locality when accessing state, complex control plane interactions, and expensive inter-function communication. We revisit the foundations of FaaS and propose a new cloud abstraction, the cloud process, that retains all the benefits of FaaS while significantly reducing the overheads that result from disaggregation. We show how established operating system abstractions can be adapted to provide powerful granular computing on dynamically provisioned cloud resources while building our Process as a Service (PraaS) platform. PraaS improves current FaaS by offering data locality, fast invocations, and efficient communication. PraaS delivers remote invocations up to 17× faster and reduces communication overhead by up to 99%.
Slides [PDF]
Marcin Copik
Application users react negatively to performance regressions or availability issues across software releases. To address this, modern cloud-based applications with their multiple daily releases rely on live testing techniques such as A/B testing or canary releases. In edge-to-cloud applications, however, which have similar problems, developers currently still have to hard-code custom live testing tooling as there is no general framework for edge-to-cloud live testing. With Umbilical Choir, we partially close this gap for serverless edge-to-cloud applications. Umbilical Choir is compatible with all Function-as-a-Service platforms and (extensively) supports various live testing techniques, including canary releases with various geo-aware strategies, A/B testing, and gradual roll-outs. We evaluate Umbilical Choir through a complex release scenario showcasing various live testing techniques in a mixed edge-cloud deployments and discuss different geo-aware strategies.
Slides [PDF]
Mohammadreza Malekabbasi
This talk explores cloud-native semi-serverless management of scientific workflows using Kubernetes. It highlights leveraging Kubernetes for resource management, scalability, workload scheduling, and cloud-agnostic deployment to handle large-scale scientific workflows comprising. The proposed solution employs scalable worker pools with horizontal scaling based on queue size and vertical scaling via dynamic CPU/memory adjustments using Vertical Pod Autoscaler (VPA). The talk includes a case study on the Montage workflow, comparing execution with and without worker pools, and vertical scaling on the GKE cloud, demonstrating improved performance and resource efficiency.
Slides [PDF]
Bartosz Balis.