X10 and APGAS at petascale
Olivier Tardieu, Benjamin Herta, et al.
PPoPP 2014
Large language models are often released as families of models with varying parameter counts and bit widths. To reduce cost, inference services are increasingly relying on dynamic model selection, preferring smaller models when possible. GPU vendors are on a journey to enable dynamic GPU slicing, making it possible for a workload to request a fraction of the compute and memory units in a GPU, and for the slices to be created and destroyed on demand without disrupting existing workloads. The onus is now on Kubernetes. The Device Management Working Group is hard at work to expose these capabilities. While vendor-agnostic slicing APIs do not exist yet, this talk demonstrates that incremental GPU slicing is possible today. We replace the Multi-Instance GPU manager, which only permits partitioning GPUs in bulk, with an open-source incremental-slicing controller without the need for new APIs or changes to the device plugin. Come learn how to achieve incremental slicing in your GPU clusters.
Olivier Tardieu, Benjamin Herta, et al.
PPoPP 2014
Max Bloomfield, Amogh Wasti, et al.
ITherm 2025
Vivek Kumar, Daniel Framptony, et al.
OOPSLA 2012
Nikoleta Iliakopoulou, Jovan Stojkovic, et al.
MICRO 2025