Talks

Papers accepted to WoAIS1 (First International Workshop on AI and Serverless, 2026). See the program for the full schedule.

Papers session 1

#1 - LLM-Assisted Interpretation of Graphene Image Analysis in Serverless Workflows

Large language models (LLMs) offer new opportunities for interpreting the outputs of scientific data pipelines, but their deployment in serverless environments is hindered by cold-start latency and model loading costs. We present a serverless workflow for graphene microscopy analysis that combines deterministic computer vision with LLM-based interpretation. The workflow first extracts structured descriptors from microscopy images, and then uses an LLM to generate proxy concern summaries and explanations from these descriptors. We compare four interpretation strategies: a deterministic rule-based baseline, a tiny LLM, a small 3B-parameter LLM, and a progressive two-stage pipeline that uses the small model for initial batch-level interpretation and a larger 7B model for refinement. Using a case-study set of descriptors from graphene and graphene-oxide analysis workflows intended to support downstream health and environmental hazard assessment, we evaluate latency, time-to-first-answer, output stability, and refinement behavior under warm and cold execution. We additionally ablate the choice of phase-1 model size in the progressive pipeline, finding that an under-capacity phase-1 model collapses selective refinement into always-on refinement: phase 1 becomes a token routing layer, and total latency reverts to that of the larger model. The results indicate that descriptor-based, progressively served LLM interpretation is a promising design pattern for responsive and explainable AI-assisted scientific workflows, where structured summaries can support subsequent domain-expert review.

Josef Spillner (Zurich University of Applied Sciences); Sepideh Shamsizadeh (remote); Valerio Schiavoni (in person)


#2 - InferenceProfiler

The widespread deployment of Large Language Models (LLMs) requires substantial computational resources, often resulting in high operational costs and energy consumption. Optimizing these complex workloads necessitates precise visibility into how hardware accelerators and host system resources are utilized. This paper presents InferenceProfiler, a software tool that measures and records the resource usage of LLM inferencing tasks. InferenceProfiler is designed to profile the CPU/GPU, memory, disk, network, and power utilization of various tasks, collecting system metrics from the virtual machine (host), cgroup, process, GPU, and inference engine. InferenceProfiler leverages the NVIDIA Management Library (NVML) to capture real-time metrics across varying NVIDIA GPU architectures. InferenceProfiler integrates with vLLM, an open-source high-throughput inference engine for serving LLMs, to correlate application-level telemetry directly with underlying metrics. InferenceProfiler supports time-series profiling to enable continuous monitoring of the resources consumed by LLM inference.
Using InferenceProfiler, we demonstrate resource requirements for LLM inferencing on alternate GPUs by benchmarking Llama-3.2-3B-Instruct on six AWS EC2 accelerated computing instance types. With a load of 1,000 concurrent inference requests, we find that the g5.xlarge instance (A10G) achieves the lowest cost per generated token ($0.0936 per run, $0.1829 per million tokens) despite being neither the least expensive ec2 GPU instance per hour nor the fastest. Additionally, the g7e.2xlarge instance (RTX 6000 Blackwell) is the fastest but 3% more expensive per token than the g6e.xlarge and 8% more costly than the g5.xlarge.

Austin Bomhold, Xinghan Chen, Morteza Nabavinejad, Wes Lloyd (University of Washington Tacoma)


Papers session 2

#3 - Benchmarking Serverless AI Architectures: Modular RAG, Serverless RAG, and Long Context Inference

The expansion of context windows in large language models (LLMs) raises a critical question: Can long-context inference fully replace Retrieval-Augmented Generation (RAG) in production? To address this, we benchmarked three serverless architectures—Modular RAG, Serverless RAG, and Long-Context Inference—against technical documents up to 300 pages. Our results reveal clear operational trade-offs. Long-Context successfully mitigates the “Lost in the Middle” phenomenon, scoring 9.6/10 in accuracy on 142-page documents. However, scaling to 300 pages introduces severe limitations: latencies spike to 21.7s and costs increase by nearly 100x compared to RAG systems. Conversely, Modular RAG maintains low latency (4.2s) and cost ($0.0012), though accuracy drops to 4.1/10 on massive datasets due to retrieval limits. Serverless RAG offers a stable middle ground. Ultimately, while Long-Context is adequate for medium-sized documents despite its higher cost, our findings suggest that beyond a certain size threshold, RAG architectures remain essential to reduce context size, ensuring scalable and cost-effective operations.

Florian Alexandru Serb Petrusel, Pedro Antonio García López (Universitat Rovira i Virgili)


#4 - Evaluating Large Language Models for Automated Serverless Workflow Generation

Large Language Models (LLMs) have shown strong capabilities in code generation, yet their effectiveness in composing multi-step serverless workflows remains underexplored. In this paper, we present a systematic empirical evaluation of four LLMs in generating executable cloud workflows involving multiple Backend-as-a-Service components. We use three representative workflows of increasing complexity and assess model performance across dimensions, including correctness and iteration effort. Our results show that while LLMs can successfully generate executable workflows, they require iterative refinement and exhibit significant variability depending on model choice. These findings provide insights into the practical limitations and opportunities of LLM-driven automation in cloud-native environments.

Sashko Ristov, Philipp Gritsch, Florian Unterhofer, Ruth Breu (University of Innsbruck)