Hosting a large language model (LLM) can be a complex and challenging task. One of the main challenges is the large model size, which requires significant computational resources and storage capacity. Another challenge is model sharding, which involves splitting the model across multiple servers to distribute the computational load. Model serving and inference workflows also need to be carefully designed and optimized to handle the high volume of requests and data. Technical expertise is also required to set up and maintain the infrastructure, including knowledge of distributed computing, data management, and machine learning. Additionally, the infrastructure setup itself can be complex and requires significant investment in hardware and software.
Some additional points to consider when it comes to the cost of hosting a large language model
Overall, the cost of hosting a large language model can be significant and require careful planning and budgeting. However, the benefits of using these models for natural language processing tasks can outweigh the costs in many cases.
When it comes to performance of a large language model you also need to think about these terms
Now let’s take a quick look into the memory requirements to load a GPT-J model. The memory requirements depends on whether you are training or serving the model. Lets do a quick math on training the GPT-J.
For FP32, you require 24GB to load the parameters, and then same for Gradients. Further, it uses Adam Optmizer that requires squared gradients too occupying another 24GB. Storing Optimizer States also require 24GB. Thus so far we need 96GB just to load one single model instance of GPT-J. Now in addition to this, we need to also load the training batch along with Activation memory footprint which will easy lead to requiring 200GB+ memory. Obviously the memory requirements will reduce in almost half if you are using FP16 model.
How many GPUs do I need to serve Llama 70B? To answer that, we first need to determine the amount of GPU memory required by the Large Language Model (LLM). This calculation can be done using a straightforward formula:
Symbol Description:
M: GPU memory expressed in Gigabytes P: The number of parameters in the model. For instance, a 7B model has 7 billion parameters. 4B: 4 bytes, indicating the bytes used for each parameter 32: There are 32 bits in 4 bytes Q: The number of bits used for loading the model. For example, 16 bits, 8 bits, or 4 bits. 1.2: Represents a 20% overhead for loading additional elements in GPU memory.
Now, let’s illustrate with some examples.
GPU Memory Required for Serving Llama 70B
Let’s calculate the GPU memory required for serving Llama 70B, loading it in 16 bits. The model has 70 billion parameters.
70 * 4 bytes 32 / 16 * 1.2 = 168 GB
That’s quite a lot of memory. A single A100 80GB wouldn’t be enough, although 2x A100 80GB should be enough to serve the Llama 2 70B model in 16 bit mode.
Now since we talked about memory consumption lets talk about how we can reduce the memory by model compression techniques .
Below is the python code to get the size of the model
from accelerate.utils import calculate_maximum_sizes, convert_bytes
from accelerate.commands.estimate import check_has_model, create_empty_model
import torch
DTYPE_MODIFIER = {"float32": 1, "float16/bfloat16": 2, "int8": 4, "int4": 8}
def calculate_memory(model: torch.nn.Module, options: list):
"Calculates the memory usage for a model init on `meta` device"
total_size, largest_layer = calculate_maximum_sizes(model)
data = []
for dtype in options:
dtype_total_size = total_size
dtype_largest_layer = largest_layer[0]
modifier = DTYPE_MODIFIER[dtype]
dtype_total_size /= modifier
dtype_largest_layer /= modifier
dtype_training_size = convert_bytes(dtype_total_size * 4)
dtype_total_size = convert_bytes(dtype_total_size)
dtype_largest_layer = convert_bytes(dtype_largest_layer)
data.append(
{
"dtype": dtype,
"Largest Layer or Residual Group": dtype_largest_layer,
"Total Size": dtype_total_size,
"Training using Adam": dtype_training_size,
}
)
return data
model_name = 'microsoft/phi-2'
model = create_empty_model(model_name, library_name=None, trust_remote_code=True, access_token=None)
results = calculate_memory(model, ["float32"])
for result in results:
print(f"Total size of the Model with dtype {result['dtype']} is {result['Total Size']}")
Pruning refers to removing redundant or less important parameters from a neural network model to reduce its size and computational requirements. This is done by systematically setting low-value weight parameters to zero. Structured pruning removes entire neurons/filters, while unstructured pruning zeros out individual weights. Pruning can reduce model size by over 90% with minimal loss in accuracy.
Knowledge distillation trains a smaller “student” model to mimic the outputs of a larger “teacher” model. The student is trained on soft targets (output distributions) from the teacher, capturing dark knowledge beyond just hard labels. This allows the student to learn complex functions learned by the teacher efficiently. Distillation can reduce compute by over 90% with minimal loss in accuracy.
Quantization reduces the precision of weights and activations from float32 to lower bit widths like int8 or int4. This shrinks model size and speeds up computation on integer-optimized hardware. Quantization applies techniques like clipping, rounding, and rescaling to discretize the continuous values while retaining model accuracy. Typical techniques are post-training quantization, quantization-aware training, and quantization-aware finetuning. So in summary, pruning, distillation and quantization are three key techniques to optimize large AI models by reducing redundancy, transferring knowledge and lowering precision respectively. Used together, they can provide massive reductions in model size and compute requirements with minimal impact on accuracy. My detailed answer covers the core concepts and tradeoffs for each technique.
Tensor parallelism is a parallelism technique used in large neural network models to distribute the computation of large neural network layers across multiple devices. The key idea is to partition the layers into smaller chunks called tensors and compute each tensor in parallel on different devices.
Some key points on tensor parallelism:
Pipeline parallelism is a technique for distributed training of large neural network models across multiple devices or nodes. In pipeline parallelism, the model is split into partitions or stages, with each stage assigned to a different device. The input is fed through the pipeline in micro-batches, with each device performing computations on the micro-batch and then passing its outputs to the next device. This allows for parallelization across devices and overlap of computation and communication. In contrast, tensor parallelism splits the model across devices by partitioning tensors, typically along the hidden dimension. For example, different slices of a large weight matrix may be assigned to different devices. The devices collectively compute the result for a layer, synchronizing gradients at each step.
The key differences between pipeline and tensor parallelism are:
Overall, generative inference of LLMs has three main challenges (according to Pope et al. 2022:(
You can also look at request batching to increase the number of requests
he industry recognized the inefficiency and came up with a better approach. Orca: A Distributed Serving System for Transformer-Based Generative Models is a paper presented in OSDI ’22 which is the first to our knowledge to tackle this problem. Instead of waiting until every sequence in a batch has completed generation, Orca implements iteration-level scheduling where the batch size is determined per iteration. The result is that once a sequence in a batch has completed generation, a new sequence can be inserted in its place, yielding higher GPU utilization than static batching.
In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is
PagedAttention is a new attention mechanism implemented in vLLM (GitHub). It takes inspiration from traditional OS concepts such as paging and virtual memory. They allow the KV cache (what is computed in the “prefill” phase, discussed above) to be non-contiguous by allocating memory in fixed-size “pages”, or blocks. The attention mechanism can then be rewritten to operate on block-aligned inputs, allowing attention to be performed on non-contiguous memory ranges.
This means that buffer allocation can happen just-in-time instead of ahead-of-time: when starting a new generation, the framework does not need to allocate a contiguous buffer of size maximum_context_length. Each iteration, the scheduler can decide if it needs more room for a particular generation, and allocate on the fly without any degradation to PagedAttention’s performance. This doesn’t guarantee perfect utilization of memory (their blog says the wastage is now limited to under 4%, only in the last block), but it significantly improves upon wastage from ahead-of-time allocation schemes used widely by the industry today.
Altogether, PagedAttention + vLLM enable massive memory savings as most sequences will not consume the entire context window. These memory savings translate directly into a higher batch size, which means higher throughput and cheaper serving.
Dynamic SplitFuse is a novel token composition strategy for prompt processing and token generation. DeepSpeed-FastGen utilizes Dynamic SplitFuse to run at a consistent forward size by leveraging the capability to take partial tokens from prompts and compose this with generation. In particular, Dynamic SplitFuse performs two key behaviors:
Together, these two techniques provide concrete benefits on all user metrics:
Hope knowing all these fundamental concepts helps you in deploying and training large language models using Azure Machine Learning.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.