llm.c
llm.c - LLM Inference Engine
llm.c is a portable C library and CLI for native LLM inference using GGUF models and GGML. It loads transformer model weights, tokenizes prompt text from GGUF metadata, builds the inference graph, applies optional LoRA adapters, generates tokens with sampling controls, and streams generated text through a small C callback API.
The library is designed as a standalone inference primitive for C/C++ applications. A kc_llm_t context owns the loaded model, backend resources, LoRA adapters, error state, and optional KV-cache used to accelerate generation.
CLI
Run native GGUF text generation from standard input. The CLI opens a model once and runs a resident request loop: each request is read from stdin until the --until delimiter byte (default EOT, byte 4), inference runs, the response is streamed to stdout, and the same delimiter is written at the end of each response. When stdin closes the process exits cleanly.
Environment Variables
All CLI parameters can be set via environment variables. CLI flags override env values, which override built-in defaults.
| Variable | Maps to | Type |
|---|---|---|
KC_LLM_MODEL | MODEL (positional) | Path string |
KC_LLM_CTX | --ctx | Integer |
KC_LLM_PREDICT | --predict | Integer |
KC_LLM_THREADS | --threads | Integer |
KC_LLM_GPU | --gpu | Integer |
KC_LLM_GPU_LAYERS | --gpu-layers | Integer |
KC_LLM_TEMP | --temp | Float |
KC_LLM_TOP_K | --top-k | Integer |
KC_LLM_TOP_P | --top-p | Float |
KC_LLM_PENALTY | --penalty | Float |
KC_LLM_REPEAT_LAST_N | --repeat-last-n | Integer |
KC_LLM_SEED | --seed | Integer |
KC_LLM_LORA | --lora (multiple) | path1:scale1,path2:scale2 |
export KC_LLM_MODEL="llama-3-8b.gguf"
export KC_LLM_TEMP="0.9"
export KC_LLM_LORA="adapter1.safetensors:0.8,adapter2.safetensors:0.5"
echo "Hello" | ./bin/x86_64/linux/llm
Examples
Single response generation:
echo "What is the capital of France?" | ./bin/x86_64/linux/llm llama-3-8b.gguf
With a LoRA adapter:
printf '%s\n' "Hello" | ./bin/x86_64/linux/llm \
base.gguf \
--lora adapter.safetensors --lora-scale 0.8
Parameters
| Flag | Description | Default |
|---|---|---|
-h, --help | Show help and usage | - |
-v, --version | Show version | - |
MODEL (positional) | Path to GGUF model file (required) | - |
--ctx N | Context size in tokens | 2048 |
--predict N | Max tokens to predict | 128 |
--threads N | Number of threads | auto |
--gpu N | GPU mode: -1 auto, 0 CPU, > 0 require GPU | -1 |
--gpu-layers N | Number of layers to offload to GPU | all |
--temp F | Temperature for sampling | 0.80 |
--top-k N | Top-k sampling parameter | 40 |
--top-p F | Top-p sampling parameter | 0.95 |
--penalty F | Repeat penalty parameter | 1.10 |
--repeat-last-n N | Last tokens for penalty | 64 |
--seed N | RNG seed (-1 for random) | -1 |
--until N | Request/response delimiter byte | 4 (EOT) |
--lora PATH | Apply a LoRA adapter (repeatable) | - |
--lora-scale F | Scale for the previous LoRA | 1.0 |
--kv-load PATH | Load a KV-cache snapshot before reading prompts | - |
--kv-save PATH | Save the KV-cache snapshot after requests | - |
Output
Generated text is written directly to standard output as it is produced, followed by the --until delimiter byte. Diagnostics and errors are written to standard error.
Compatibility
llm.c supports the GGUF model families implemented by its local graph builders and tokenizer backends:
- Llama-style:
llama,mistral,mixtral- SiLU activation, RoPE, standard GQA - Qwen-style:
qwen2,qwen2.5,qwen3- SiLU activation, RoPE with Qwen freq base, Qwen3 Q/K RMS normalization - Gemma-style:
gemma- GELU activation, embedding scale √n_embd - GPT-2:
gpt2- LayerNorm, learned position embeddings, GELU, non-gated FFN
The engine requires all mandatory tensors (token embeddings, output norm, and standard transformer block weights) to be present in the GGUF file. Quantized models (Q4_0, Q4_K_M, Q8_0, etc.) are supported via GGML.
Tokenization
Prompt text is encoded by the tokenizer implementation selected from the GGUF metadata. The engine currently implements:
- BPE: GPT-2 byte-level BPE (
tokenizer.ggml.model = gpt2) withgpt2,gpt-2,qwen2, andsmollmpre-tokenizers. - SentencePiece: Google SentencePiece (
tokenizer.ggml.model = llama), used by LLaMA, Mistral, and Gemma models. - Unigram: Google Unigram (
tokenizer.ggml.model = unigram).
Model compatibility is therefore bounded by both the architecture metadata and the tokenizer metadata. Unsupported architectures, tokenizer models, or tokenizer pre-tokenizers fail during model load with a clear error.
Public API
#include "llm.h"
kc_llm_options_t opts = kc_llm_options_default();
opts.model_path = strdup("model.gguf");
kc_llm_options_load_env(&opts); /* optional: override from env */
kc_llm_t *ctx = NULL;
if (kc_llm_open(&ctx, &opts) == 0) {
kc_llm_generate(ctx, "Hello!", write_callback, user_data);
kc_llm_close(ctx);
}
kc_llm_options_free(&opts); /* release owned strings */
Persistent KV-cache snapshots can precompute bootstrap context once and reuse it for later requests:
kc_llm_prefill(ctx, "bootstrap prompt");
kc_llm_kv_save(ctx, "agent.kv");
kc_llm_kv_clear(ctx);
kc_llm_kv_load(ctx, "agent.kv");
kc_llm_generate(ctx, "continue from here", write_callback, user_data);
Memory & Ownership
llm.c uses a clear ownership model to ensure predictable memory behavior:
- Options:
kc_llm_options_tis copied duringkc_llm_open(). Call
kc_llm_options_free() to release owned strings and the dynamic LoRA array after the copy is made.
- Callbacks: The buffer provided to the
kc_llm_write_fncallback is owned by the library and is only valid for the duration of that specific callback execution. - Errors:
kc_llm_error()returns a pointer to a context-owned string. It remains valid until the next state-modifying call on that context. - Generation: The caller owns prompt storage before and after each
kc_llm_generate()call. - KV snapshots: Snapshot files contain runtime K/V tensor prefixes and model metadata only. They do not store prompt text and are rejected when the runtime dimensions do not match. Loading a snapshot replaces the logical KV position in the existing context without reloading model weights.
Lifecycle
kc_llm_options_default()- returns an options struct with built-in defaults.kc_llm_options_load_env()- overrides options fromKC_LLM_*environment variables.kc_llm_lora_add()- appends a LoRA descriptor (path + scale) to the options array.kc_llm_options_free()- frees owned strings and the LoRA array within options.kc_llm_open()- allocates and prepares a new LLM context with specific options.kc_llm_lora_apply()- loads and registers a safetensors LoRA adapter.kc_llm_lora_clear()- releases all applied adapters.kc_llm_prefill()- evaluates prompt text into the KV-cache without output.kc_llm_kv_save()- writes the current KV-cache state to a snapshot file.kc_llm_kv_load()- loads a compatible KV-cache snapshot.kc_llm_kv_clear()- clears the logical KV-cache position.kc_llm_generate()- performs synchronous generation from input. Supports streaming via callback.kc_llm_stop()- thread-safe mechanism to stop an ongoing generation.kc_llm_close()- releases the context and all associated resources.
KV Snapshot Usage
echo "Initial request." | ./bin/x86_64/linux/llm \
model.gguf \
--kv-save checkpoint.kv
echo "Implement the requested change." | ./bin/x86_64/linux/llm \
model.gguf \
--kv-load checkpoint.kv \
--kv-save session.kv
KV snapshots are tied to the loaded model dimensions, context size, KV head count, head dimension, layer count, and F32 KV tensor type. Incompatible files fail during load instead of being applied.
Build
Compiled artifacts are generated under bin/{arch}/{platform}/ for the host architecture running the build.
make clean && make
CUDA support is opt-in. Pass CUDA=1 to request a CUDA-enabled build. The flag only has an effect for supported targets and only when the build machine has a usable CUDA toolkit; otherwise the build remains CPU-only.
make CUDA=1
make CUDA=1 x86_64/linux
When GPU support is available, the engine follows these semantics:
--gpu -1(Auto): Uses GPU if a compatible device is found, falls back to CPU otherwise.--gpu 0: Disables GPU and strictly uses the CPU backend.--gpu >0: Explicitly requires a GPU. Fails with a descriptive error if CUDA support was not enabled at build time or no compatible device is found at runtime.--gpu-layers N: Controls how many transformer layers are offloaded to VRAM. Any remaining layers are kept in RAM. This allows running large models on limited hardware by combining GPU and CPU resources.
CUDA Output Directory
When building with CUDA=1, artifacts are placed in a separate cuda/ subdirectory to keep CPU and GPU builds separate:
bin/x86_64/linux/llm # CPU-only build (make)
bin/x86_64/linux/cuda/llm # GPU-enabled build (make CUDA=1)
bin/x86_64/linux/libllm.so # CPU-only shared library
bin/x86_64/linux/cuda/libllm.so # GPU-enabled shared library
Each target uses an isolated CMake build directory (.build/{arch}-linux-cuda/) to avoid cache conflicts, but only when the architecture is CUDA-capable (x86_64 or aarch64). When CUDA=1 is set for an architecture that does not support CUDA, the build proceeds CPU-only with a notification and the output goes to the standard output directory without the cuda/ subdirectory.
make # builds CPU-only → bin/x86_64/linux/llm
make CUDA=1 # builds with CUDA if supported → bin/x86_64/linux/cuda/llm
make CUDA=1 all # CUDA on capable archs, CPU with notice on others
To test the CUDA binary:
TEST_BIN=./bin/x86_64/linux/cuda/llm sh test.sh
Multiarch Builds
The project is prepared to build artifacts for multiple architectures under bin/{arch}/{platform}/. A plain make builds only the current host architecture, while the targets below build the full matrix or a specific target.
make all
make x86_64/linux
make x86_64/windows
make i686/linux
make i686/windows
make aarch64/linux
make aarch64/android
make armv7/linux
make armv7/android
make armv7hf/linux
make riscv64/linux
make powerpc64le/linux
make mips/linux
make mipsel/linux
make mips64el/linux
make s390x/linux
make loongarch64/linux
Dependencies
| Path | Description |
|---|---|
lib/ggml/ | Tensor computation library for machine learning |
lib/kaisarcode/gguf/ | GGUF I/O and tokenizers library |
lib/model.gguf | Embedded model weights |
Troubleshooting
CUDA cross-compilation
See etc/docs/cuda-cross-compile.md.
Beta Notice
This is a beta project tested only on Debian x86_64. It was created out of a personal need for these libraries, but no guarantees are provided regarding its stability or future support. You are free to test it, use it, and modify it as you please.
If you'd like to reach out, you can send an email to [email protected]. Please note that I do not accept pull requests; the goal is to avoid long-term dependency on platforms like GitHub, and I do not maintain fixed infrastructure to guarantee long-term stability for these projects.
Repo
You can download the repository and read the most up-to-date documentation directly from its official source.
GitHub: kaisarcode/llm.c
License
This project is distributed under the GNU General Public License version 3 (GPLv3).
