The Samaritan Compendium

Local AI that lives in your infrastructure, not their subpoena logs.

GAML — GPU Accelerated Model Loading — solves one specific bottleneck in self-hosted LLM workflows: the wait. Loading a 70B GGUF model on CPU is a 40+ minute coffee break. GAML cuts that to 5–8 minutes using parallel GPU processing where llama.cpp's loader does sequential CPU work.

The bottleneck

GGUF model loading is dequantization plus tensor staging. The reference implementation does this on CPU, one tensor at a time, because that's the portable path. It works. It's also the longest single step in spinning up local inference on a fresh boot or a model swap.

For someone running a homelab with multiple models — switching between a coding model, a chat model, and a summarizer — that's tens of minutes of dead time per swap. GAML rewrites that step as a GPU pipeline.

How it works

Overlapped GPU pipeline. Triple-buffered async processing — while one tensor is dequantizing on the GPU, the next is transferring from disk, and the one before it is being verified. Three stages running in parallel where llama.cpp runs one.

Q4_K dequantization kernel. Hand-written CUDA kernel for the most common quantization format. Q4_K is what most people are running because of the quality-to-size tradeoff; optimizing this one format moves the needle for most users.

Bit-perfect verification. Every dequantized tensor is verified against the reference output. No silent precision loss. If GAML's output diverges from llama.cpp's, it fails loudly.

Context-aware memory planning. GAML knows the context window you're targeting and plans GPU memory accordingly. No swap thrash mid-load.

What it doesn't do

GAML does not do inference. This is important. GAML accelerates the loading step. Once the model is loaded into the optimized format, you hand it to llama.cpp, Ollama, or whatever inference engine you already use. GAML is a preprocessor, not a replacement.

The integration is three steps:

Use GAML to load the GGUF
Hand the optimized model to your inference engine
Serve inference locally

You still need llama.cpp (or equivalent) for the actual generation. GAML just gets you there faster.

Requirements

NVIDIA GPU, compute capability 6.1+
4GB+ VRAM minimum (more VRAM = larger models)
CUDA-compatible host

Check your GPU first:

./gaml --gpu-info

Install

git clone https://github.com/Fimeg/GAML.git
cd GAML
./docker-build.sh

Usage

Load a model with a 2048-token context:

./gaml --ctx 2048 model.gguf

Benchmark on your hardware:

./gaml --benchmark

Where it fits

GAML matters most when you're switching models frequently or running on hardware where load times dominate the workflow. If you load a model once and run inference for hours, the speedup is nice but not transformative. If you're rotating models throughout the day, it changes what's practical.

The broader point: the local-AI stack is fast enough now that the friction is in the seams, not the inference. GAML is one seam. Closing seams is how self-hosted AI gets to the point where it stops feeling like a compromise.

Links: GitHub · Releases