SplitLLM: Collaborative Inference of LLMs for Model Placement and
Throughput Optimization
- Akrit Mudvari
- Yuang Jiang
- Leandros Tassiulas
Large language models (LLMs) have been a disruptive innovation in recent
years, and they play a crucial role in our daily lives due to their ability to
understand and generate human-like text. Their capabilities include natural
language understanding, information retrieval and search, translation,
chatbots, virtual assistance, and many more. However, it is well known that
LLMs are massive in terms of the number of parameters. Additionally, the
self-attention mechanism in the underlying architecture of LLMs, Transformers,
has quadratic complexity in terms of both computation and memory with respect
to the input sequence length. For these reasons, LLM inference is
resource-intensive, and thus, the throughput of LLM inference is limited,
especially for the longer sequences. In this report, we design a collaborative
inference architecture between a server and its clients to alleviate the
throughput limit. In this design, we consider the available resources on both
sides, i.e., the computation and communication costs. We develop a dynamic
programming-based algorithm to optimally allocate computation between the
server and the client device to increase the server throughput, while not
violating the service level agreement (SLA). We show in the experiments that we
are able to efficiently distribute the workload allowing for roughly 1/3
reduction in the server workload, while achieving 19 percent improvement over a
greedy method. As a result, we are able to demonstrate that, in an environment
with different types of LLM inference requests, the throughput of the server is
improved.