Analog AI as a Service: A Cloud Platform for In-Memory Computing
Kaoutar El Maghraoui, Kim Tran, et al.
SSE 2024
Large Language Models (LLMs) with their remarkable generative capacities have significantly impacted various fields, yet face challenges due to their immense parameter counts and the resulting high costs of training and inference. The trend of increasing model sizes exacerbates these challenges, particularly in terms of memory footprint, latency, and energy consumption. Traditional hardware like GPUs, while powerful, are not optimally efficient for LLM inference, leading to a growing dependence on cloud services. In this paper, we explore the deployment of Mixture of Experts (MoE)-based models on 3D Non-Volatile Memory (NVM)-based Analog In-Memory Computing (AIMC) hardware. When combined with the MoE network architecture, this novel hardware paradigm, utilizing stacked NVM devices arranged in a crossbar array, offers an innovative solution to the parameter fetching bottleneck typical in traditional models deployed on conventional von Neumann-based architectures. By simulating the deployment of both dense and MoE-based LLMs on an abstract 3D NVM-based AIMC system, we demonstrate that MoE-based models, due to their conditional compute paradigm, are better suited to this hardware, scaling more favorably and maintaining high performance even in the presence of noise typical of analog computations. Our findings suggest that MoE-based models, in conjunction with emerging 3D NVM-based AIMC, can significantly reduce the inference costs of state-of-the-art LLMs, making them more accessible and energy-efficient.
Kaoutar El Maghraoui, Kim Tran, et al.
SSE 2024
Marcelo Amaral
OSSEU 2023
Rasoul Behravesh, David Breitgand, et al.
INFOCOM 2024
Riselda Kodra, Hadjer Benmeziane, et al.
ICLR 2025