Hardware
This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:
- Effective dollars per unit of performance is the target
- A server or always-on system is desired
- Access to the server from inside and outside the network is desired
- Electricity is a factor in the costs
- You are proficient in hardware and software
- The desired function is inference, not training
Hardware Factors in LLM Performance
LLM inference has two distinct phases with different hardware requirements:
1. Prompt Processing (Compute-Bound)
Prompt processing is bound by compute speed - FLOPS (floating point operations per second) or TOPS (integer operations per second). Since we are dealing with quantized model weights, the most important specs are 8bit integers (INT8 TOPS) and 16bit floats (FP16 FLOPS). GPUs excel at the parallel processing required, making them ideal for prompt processing.
- Bottleneck is raw compute power (FLOPS/TOPS)
- GPUs excel due to parallel processing while CPUs are terrible
- Look for INT8 TOPS, FP16 FLOPS
This is why you should definitely have a GPU in your system for prompt processing, regardless of how much memory bandwidth your system has.
Example
Context window of 8192 tokens:
- Prompt: 200 tokens
- Chat history: 3000 tokens
- Document: 1000 tokens
- Image: 500 tokens
- Total used: 4700 tokens for processing (if not cached)
- Remaining: 3492 possible tokens for generation
2. Token Generation (Memory-Bound)
Token generation is bound by memory bandwidth. To generate each token, the entire model must be sent to the processor layer by layer. This means whatever the size of the model, that amount of data must be transferred for every generated token. This data movement almost always takes longer than the actual processing. GPUs typically have higher memory bandwidth than CPUs, but consumer motherboards often have limited RAM bandwidth due to having only two memory channels. Adding memory channels increases bandwidth proportionally and is achievable by using multiple smaller RAM sticks (two per channel).
- Bottleneck is memory bandwidth
- More memory channels > faster memory speed
- Dual CPUs double memory channels
Because of this, much older servers or professional workstations often outperform new consumer hardware at token generation because they have more memory channels.
Calculating Memory Bandwidth
Formula: bandwidth_gbps = (memory_speed_mhz × 2 for DDR × 64 for bits per channel × channels) ÷ 8000 for bits per GB
Example
DDR4-3200:
- Speed is 1600mhz or 1600 million clocks per second
- DDR means double the speed for 3200 million clocks per second
- Multiply by 64 bits per clock for 204800 million bits second
- Divide by 8000 to get 25.6 billion bytes per second (GB/s)
Therefore, at DDR4-3200 we get:
- Single channel: 25.6 GB/s
- Dual channel: 51.2 GB/s
- Quad channel: 102.4 GB/s
- Dual CPUs, each quad channel: 204.8GB/s