Local AI Server Guide

Archived on 2025-09-23 03:42:19

Hardware

This section is about hardware needs for running a local LLM. The goals are assumed to be as follows:

  1. Effective dollars per unit of performance is the target
  2. A server or always-on system is desired
  3. Access to the server from inside and outside the network is desired
  4. Electricity is a factor in the costs
  5. You are proficient in hardware and software
  6. The desired function is inference, not training

Hardware Factors in LLM Performance

LLM inference has two distinct phases with different hardware requirements:

1. Prompt Processing (Compute-Bound)

Prompt processing is bound by compute speed - FLOPS (floating point operations per second) or TOPS (integer operations per second). Since we are dealing with quantized model weights, the most important specs are 8bit integers (INT8 TOPS) and 16bit floats (FP16 FLOPS). GPUs excel at the parallel processing required, making them ideal for prompt processing.

This is why you should definitely have a GPU in your system for prompt processing, regardless of how much memory bandwidth your system has.

Example
Context window of 8192 tokens:

2. Token Generation (Memory-Bound)

Token generation is bound by memory bandwidth. To generate each token, the entire model must be sent to the processor layer by layer. This means whatever the size of the model, that amount of data must be transferred for every generated token. This data movement almost always takes longer than the actual processing. GPUs typically have higher memory bandwidth than CPUs, but consumer motherboards often have limited RAM bandwidth due to having only two memory channels. Adding memory channels increases bandwidth proportionally and is achievable by using multiple smaller RAM sticks (two per channel).

Because of this, much older servers or professional workstations often outperform new consumer hardware at token generation because they have more memory channels.

Calculating Memory Bandwidth

Formula: bandwidth_gbps = (memory_speed_mhz × 2 for DDR × 64 for bits per channel × channels) ÷ 8000 for bits per GB

Example

DDR4-3200:

Therefore, at DDR4-3200 we get:

Part 3: Base Knowledge

There are things that will be required or extremely helpful to know going forward. I recommend attempting to commit as much as you can to memory:

Terms and Jargon

Models are the whole of the AI unit. As person is to human, a model is to AI.

Weights are the files the AI is stored on. They come in different forms, but the kinds of weights we will be dealing with are quantized.

Quantization removes precision from model weights and makes them smaller. Using different methods, the numbers in the weights are converted to smaller representations, making them easier to store, easier to fit into memory, and faster. This can be thought of as a type of lossy compression. There are different methods of quantization, but the one we are going to focus on is GGML, specifically GGUF.

GGUFs are a file extension and a type of quantized model weights used by llama.cpp.

llama.cpp is an inference backend, and is the basis for a lot of wrappers like Ollama, llama-box, LM Studio, KoboldCpp, Jan, and many others. It can also be used by itself without a wrapper, though the wrappers usually have some features making it much easier to use.

Image projectors are additional model weights stored in a single GGUF file which allows a language model to have vision capability.

Huggingface is a website that offers model weight storage and downloading, using a git system. It is the home of foundational models and finetunes.

Finetunes are models which have been trained again for a specific use case. Many finetunes are an attempt to uncensor foundation models to varying degrees of effectiveness.

Abliteration is a model with the capability to refuse requests specifically removed programmatically rather than through additional training.

MCP is a protocol which allows models to call tools offered to them which can provide information or capabilities they do not have, like web search or file access.

Instruct training is used to turn a base model into an instruct, or it model which can follow directions and chat. It requires an instruct template, which is a specific way of wrapping a prompt in tags that the model recognizes and informs it when turns begin and end, what part of the chat was the user or the model, and how to call tools, and things like that. Without instruct training a model will only complete the text you send it, since it does not know how turns work.

OpenAI is not just a company which makes a shitty chat model helmed by a giant psychotic asshole, but is a standard for sending requests to models serving an API. Think of it like USB: there is a standard which lets you plug pretty much anything into a computer, but what happens once the device talks to the computer is not the concern of USB; as long as the plug fits and the computer can talk to the device, the job of the USB layer is finished. The OpenAI standard is the USB of LLMs and lets you abstract away chat history, instruct templates, tool calling, and all that annoying crap and just send messages and get a response back.

Offloading is putting model layers onto the video card to be processed instead of the CPU.

Samplers are settings that determine the probability distribution of the output tokens. They directly affect the quality and type of response that the model is able to give.

Context is the conversation log that fits into the model’s working memory. It contains the prompts, outputs, and user inputs, and any media or documents given to the model to perform the task asked of it. There is a limited amount of tokens that can fit here, along with a large performance penalty as it gets filled. In addition, there must be enough room left over for the model to generate an output.

GGUF Quants

Let’s get into some specifics about GGUFs. They have a specific naming convention which is more or less followed: FineTune-ModelName-ModelVersion-ParameterSize-TrainingType-QuantType_QuantSubType.gguf with the image projector having an mmproj prefix.

The parameter size is usually a number followed by a ‘b’ and indicates how many numbers are in the weights, with the ‘b’ meaning billions. Those parameters are each composed of a number of bits specified by the quant type.

Example:

Mistral-Small-3.1-24B-Instruct-2503-Q6_K.gguf

It is not important to know what the letters mean specifically after the Q, just that they are ways of adding more or less precision to certain parts of the model, and that K_M is generally the one you want if you can fit it.

How do you know if you can fit a model in your VRAM?

A quick rule of thumb is to multiply the parameter size by the quant precision and divide by 8 to get a number in GB. For instance 24b at Q6 would be (24 * 6) / 8 = 18GB. Hey, would you look at that, it is also the file size!

But of course you will need room for the context cache, so you need to add about 20% to that to get the VRAM required to offload it completely onto GPU.

GGUFs are sometimes composed of multiple files due to huggingface having a 50GB single file limit.

Building a Server

We have established that you need a GPU, but you have to put a GPU into something.

The Ideal Server

The ideal server would be a new top-of-the-line workstation with AMD EPYC or Intel Xeon with maxed-out DDR5 channels and a 2KW power supply, but I don’t know anything about top-of-the-line hardware and only care about getting the cheapest, yet most capable box for the money. I will set an ideal all-encompassing target price of $1000, including the GPU(s), using current (as of July 2025) eBay prices for the completed system. The system will be composed of:

  1. At least one CPU
  2. 128GB or more of system RAM, in such distribution as to max out the channels
  3. Cooling
  4. Power supply capable of handling at least two LLM capable GPUs
  5. Solid state drive for the OS
  6. Solid state or regular drive for the model weights
  7. Wired networking
  8. Case

We could build it, but with used parts, sourcing everything and fitting it all together is a nightmare, and very rarely cost effective. We are left with older professional workstations made by Dell, HP, and Lenovo.

Caveats

These things are HEAVY. They are very picky about what hardware is inside them, including and especially RAM. They have proprietary base components like motherboards, coolers, PSUs, cases, and cabling. They go through extensive self tests and have a long boot-up sequence. They are large and ugly. They are generally not used by hobbyists and have really long extensive enterprise warranties so you will find little in the way of community help for them and often will get told to contact manufacturer support for any issue. They have a lot of strange foot-guns in the BIOS options.

BUT…

They are much easier to work on than servers. They are much quieter than servers. They are much more reliable than consumer machines. They generally contain everything needed for a great LLM box. They are generally cheap and available in all sorts of configurations once they pass the enterprise warranty period and get dumped on the secondary market (though this can take quite a long time).

What To Avoid

These are the things I know about; I am sure there are many more. If you have ANY information that should be added here, especially if you had to figure it out yourself, then please reply to this thread with it. There is probably nowhere else someone would be able to find it!

Avoid the Precision 7920/7820/5820/5720 models from 2017.

Avoid any system that:

Build

Let’s build an imaginary server with real parts. I will screenshot eBay listings to illustrate what I look for and for posterity.

Note: This is just an exercise, you will have to decide your own specifications and needs and what is available on the market, and plan accordingly.

Let’s start with the HP G4 Z series.

Here is one to check out:

01_cr_cr

Let’s search for the CPU.

02_cr

SKIP.

Here is another one.

03_cr

Passed the memory bandwidth check.

06_cr

Let’s check for features. First, I see if I can find a picture with the serial number.

03b_cr

Then I put it into the HP support page.

04_cr

Good for PSU. Good for PCIe slots.

05_cr

See if we have enough room in the case for GPUs.

07_cr

I do a search on the web for “HP z6 g4 rebar above 4g decode” to make sure it can support datacenter GPUs. A few Reddit posts confirm that it should be able to with a BIOS update. If it doesn’t, we can flash the ‘desktop’ mode in P40s or P100s to set their rebar low, but that is a last resort.

Now we need to scope out some memory. I do another web search and find this info. This is really good news. It means with six DIMM slots we get 6 channels!

08_cr_cr

I see on one of the photos that there are 3 filled slots, so we will take one out and add 4x32GB Registered DIMMs.

09_cr

Now we need a GPU or two. What to do. Might as well go with old reliable and boring 3060 12GB.

11_cr

Let’s grab a cheap P100 for the extra VRAM.

12_cr

Time to total her up:

Component Price (USD)
HP Z6 G4 309.98
128GB RAM (4x32GB) 140
3060 12GB 230
P100 16GB 138
NVMe 2TB $100
NVMe->PCIe $15
Power adapter, fan, etc for P100 $50
Unforeseen expenses $75
Total $1,057.98

By a nose!

Building the Server

I will go through and build an AI server out of parts that I have and document the process.

Note: this part of the guide is going to assume that you are able to troubleshoot hardware issues yourself. It will not contain any troubleshooting information. The guide assumes you can handle finding out if hardware is incompatible or broken and to replace it as needed. It is not meant to be comprehensive but merely informative of the process.

All the parts have arrived! Now we get to do the real fun stuff.

First step is to make sure it boots as-is. The first time takes a while it has to go through its checks and self-training again and this will happen every time I add or remove a part. Once that is confirmed, the RAM goes in and then we set the BIOS. Reset to defaults first in the menu and reboot again. The most important part in here is CSM gets turned OFF and then UEFI boot ON. Second is the MMIO or Above 4G Decode, which must be ENABLED. Set the rest of the stuff in here to your liking. After another reboot proves we didn’t break anything, shut it down, pull the GPU out that came with it, and put the new (used) one in. If you have more than one, pick one to be the primary, install that in the same slot as the one that came with it, and hook up any power connectors needed. If you have more than one GPU, leave all but the primary one out at this point.

About cooling and power for P40/P100 style datacenter cards

If you have a P40 or P100 you will need to attach a fan to it. There are different ways of doing this. I 3D printed an adapter and attached a 75mm blower on to it, and hooked that up to a temperature sensing fan controller. Another 3D printed part holds the NTC temp probe on the card. The best place to mount the temp probe is at the front on top, as it is metal, it is where the hottest space is that you can find on the outside of the card, and it already has screw holes and screws you can use. On top of that you will need 8 pin PCIe to 8 pin EPS power converters and a way to power the fan. You will have to find the adapters yourself, they are about $10-$15, and if you have limited PCIe power connectors you will need to get a splitter for them. You will need two 6 pins, an 8 pin split to two 6 pins, or two 8 pins for each P100 or P40 card in your system. For the fan, some people just pull the connectors pin from the fan connector and shove them into the 12V and GND inside the PCIe or molex power, but I cut apart a SATA power connector and soldered on headers for this one.

OS Install

Another note: this guide is not a comprehensive guide on Linux use. I assume you will understand things like command line and how to solve software problems.

Now we will install Ubuntu Server. Why Ubuntu? Because it works and I am used to it. I couldn’t give two shits about Linux distros, so if you are big on Arch or whatever, use that, but any instructions in the guide are for Ubuntu. Also, I am not going to install a desktop environment, because I don’t need one. Once it is working ssh will be the primary mode of administration and it won’t even have a monitor.

Download the latest point release of Ubuntu Server LTS and put it on a USB stick, using Rufus or something else to make it bootable.

When you install the OS, you will need to enable a few things, the rest is at your discretion:

Once the OS has installed and booted, make sure to update:

sudo apt update
sudo apt upgrade

At this point we can just SSH from another machine and do things there. Find the IP address of your LLM box—it will be given upon login or by typing

ip addr show

and you can ssh into it. I suggest setting it as a static IP on your router.

01.login

Now we will get the nvidia ecosystem sorted out. Shut down and put any additional GPUs in your box if you have them. When booted type:

sudo ubuntu-drivers devices
02.drivers

Find the recommended driver in the list and install it.

sudo apt install nvidia-driver-575

Reboot.

Next we install CUDA toolkit. Go here and find the instructions, then type what they tell you into your terminal

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-9
nano ~/.bashrc

Paste this at the end of .bashrc

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Now check that its all good:

source ~/.bashrc
nvidia-smi
nvcc -V

And it’s ready to go!

If you want to get an LLM running right now:

wget -O koboldcpp https://github.com/LostRuins/koboldcpp/releases/latest/download/koboldcpp-linux-x64
chmod +x koboldcpp

Note: If you use a P40 or P100 add this to the following commands:

--usecublas rowsplit

If you have 24GB VRAM:

./koboldcpp https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF/blob/main/google_gemma-3-27b-it-Q4_K_M.gguf --mmproj https://huggingface.co/bartowski/google_gemma-3-27b-it-GGUF/blob/main/mmproj-google_gemma-3-27b-it-f16.gguf --contextsize 8192 --gpulayers 999

If you have 12GB or 16GB VRAM:

./koboldcpp https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/google_gemma-3-12b-it-Q4_K_M.gguf --mmproj https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/mmproj-google_gemma-3-12b-it-f16.gguf --contextsize 8192 --gpulayers 999
04.kobold

When that loads, go to the server’s ip address with :5001 at the end in a web browser, and talk to your LLM!

06.chat

Koboldcpp will work with any OpenAI compatible client by using the URL + port with /v1 at the end as the endpoint. Example: http://192.168.1.44:5001/v1

DO NOT INSTALL OLLAMA it is a giant pain in the ass to get off of your system. You have been warned.