KaisarCode

llm.c

llm.c - LLM Inference Engine

llm.c is a portable C library and CLI for native LLM inference using GGUF models and GGML. It loads transformer model weights, tokenizes prompt text from GGUF metadata, builds the inference graph, applies optional LoRA adapters, generates tokens with sampling controls, and streams generated text through a small C callback API.

The library is designed as a standalone inference primitive for C/C++ applications. A kc_llm_t context owns the loaded model, backend resources, LoRA adapters, error state, and optional KV-cache used to accelerate generation.


CLI

Run native GGUF text generation from standard input. The CLI opens a model once and runs a resident request loop: each request is read from stdin until the --until delimiter byte (default EOT, byte 4), inference runs, the response is streamed to stdout, and the same delimiter is written at the end of each response. When stdin closes the process exits cleanly.

Environment Variables

All CLI parameters can be set via environment variables. CLI flags override env values, which override built-in defaults.

VariableMaps toType
KC_LLM_MODELMODEL (positional)Path string
KC_LLM_CTX--ctxInteger
KC_LLM_PREDICT--predictInteger
KC_LLM_THREADS--threadsInteger
KC_LLM_GPU--gpuInteger
KC_LLM_GPU_LAYERS--gpu-layersInteger
KC_LLM_TEMP--tempFloat
KC_LLM_TOP_K--top-kInteger
KC_LLM_TOP_P--top-pFloat
KC_LLM_PENALTY--penaltyFloat
KC_LLM_REPEAT_LAST_N--repeat-last-nInteger
KC_LLM_SEED--seedInteger
KC_LLM_LORA--lora (multiple)path1:scale1,path2:scale2
export KC_LLM_MODEL="llama-3-8b.gguf"
export KC_LLM_TEMP="0.9"
export KC_LLM_LORA="adapter1.safetensors:0.8,adapter2.safetensors:0.5"
echo "Hello" | ./bin/x86_64/linux/llm

Examples

Single response generation:

echo "What is the capital of France?" | ./bin/x86_64/linux/llm llama-3-8b.gguf

With a LoRA adapter:

printf '%s\n' "Hello" | ./bin/x86_64/linux/llm \
  base.gguf \
  --lora adapter.safetensors --lora-scale 0.8

Parameters

FlagDescriptionDefault
-h, --helpShow help and usage-
-v, --versionShow version-
MODEL (positional)Path to GGUF model file (required)-
--ctx NContext size in tokens2048
--predict NMax tokens to predict128
--threads NNumber of threadsauto
--gpu NGPU mode: -1 auto, 0 CPU, > 0 require GPU-1
--gpu-layers NNumber of layers to offload to GPUall
--temp FTemperature for sampling0.80
--top-k NTop-k sampling parameter40
--top-p FTop-p sampling parameter0.95
--penalty FRepeat penalty parameter1.10
--repeat-last-n NLast tokens for penalty64
--seed NRNG seed (-1 for random)-1
--until NRequest/response delimiter byte4 (EOT)
--lora PATHApply a LoRA adapter (repeatable)-
--lora-scale FScale for the previous LoRA1.0
--kv-load PATHLoad a KV-cache snapshot before reading prompts-
--kv-save PATHSave the KV-cache snapshot after requests-

Output

Generated text is written directly to standard output as it is produced, followed by the --until delimiter byte. Diagnostics and errors are written to standard error.

Compatibility

llm.c supports the GGUF model families implemented by its local graph builders and tokenizer backends:

  • Llama-style: llama, mistral, mixtral - SiLU activation, RoPE, standard GQA
  • Qwen-style: qwen2, qwen2.5, qwen3 - SiLU activation, RoPE with Qwen freq base, Qwen3 Q/K RMS normalization
  • Gemma-style: gemma - GELU activation, embedding scale √n_embd
  • GPT-2: gpt2 - LayerNorm, learned position embeddings, GELU, non-gated FFN

The engine requires all mandatory tensors (token embeddings, output norm, and standard transformer block weights) to be present in the GGUF file. Quantized models (Q4_0, Q4_K_M, Q8_0, etc.) are supported via GGML.

Tokenization

Prompt text is encoded by the tokenizer implementation selected from the GGUF metadata. The engine currently implements:

  • BPE: GPT-2 byte-level BPE (tokenizer.ggml.model = gpt2) with gpt2, gpt-2, qwen2, and smollm pre-tokenizers.
  • SentencePiece: Google SentencePiece (tokenizer.ggml.model = llama), used by LLaMA, Mistral, and Gemma models.
  • Unigram: Google Unigram (tokenizer.ggml.model = unigram).

Model compatibility is therefore bounded by both the architecture metadata and the tokenizer metadata. Unsupported architectures, tokenizer models, or tokenizer pre-tokenizers fail during model load with a clear error.


Public API

#include "llm.h"

kc_llm_options_t opts = kc_llm_options_default();
opts.model_path = strdup("model.gguf");
kc_llm_options_load_env(&opts);         /* optional: override from env */

kc_llm_t *ctx = NULL;
if (kc_llm_open(&ctx, &opts) == 0) {
    kc_llm_generate(ctx, "Hello!", write_callback, user_data);
    kc_llm_close(ctx);
}
kc_llm_options_free(&opts);             /* release owned strings */

Persistent KV-cache snapshots can precompute bootstrap context once and reuse it for later requests:

kc_llm_prefill(ctx, "bootstrap prompt");
kc_llm_kv_save(ctx, "agent.kv");
kc_llm_kv_clear(ctx);
kc_llm_kv_load(ctx, "agent.kv");
kc_llm_generate(ctx, "continue from here", write_callback, user_data);

Memory & Ownership

llm.c uses a clear ownership model to ensure predictable memory behavior:

  • Options: kc_llm_options_t is copied during kc_llm_open(). Call

kc_llm_options_free() to release owned strings and the dynamic LoRA array after the copy is made.

  • Callbacks: The buffer provided to the kc_llm_write_fn callback is owned by the library and is only valid for the duration of that specific callback execution.
  • Errors: kc_llm_error() returns a pointer to a context-owned string. It remains valid until the next state-modifying call on that context.
  • Generation: The caller owns prompt storage before and after each kc_llm_generate() call.
  • KV snapshots: Snapshot files contain runtime K/V tensor prefixes and model metadata only. They do not store prompt text and are rejected when the runtime dimensions do not match. Loading a snapshot replaces the logical KV position in the existing context without reloading model weights.

Lifecycle

  • kc_llm_options_default() - returns an options struct with built-in defaults.
  • kc_llm_options_load_env() - overrides options from KC_LLM_* environment variables.
  • kc_llm_lora_add() - appends a LoRA descriptor (path + scale) to the options array.
  • kc_llm_options_free() - frees owned strings and the LoRA array within options.
  • kc_llm_open() - allocates and prepares a new LLM context with specific options.
  • kc_llm_lora_apply() - loads and registers a safetensors LoRA adapter.
  • kc_llm_lora_clear() - releases all applied adapters.
  • kc_llm_prefill() - evaluates prompt text into the KV-cache without output.
  • kc_llm_kv_save() - writes the current KV-cache state to a snapshot file.
  • kc_llm_kv_load() - loads a compatible KV-cache snapshot.
  • kc_llm_kv_clear() - clears the logical KV-cache position.
  • kc_llm_generate() - performs synchronous generation from input. Supports streaming via callback.
  • kc_llm_stop() - thread-safe mechanism to stop an ongoing generation.
  • kc_llm_close() - releases the context and all associated resources.

KV Snapshot Usage

echo "Initial request." | ./bin/x86_64/linux/llm \
  model.gguf \
  --kv-save checkpoint.kv

echo "Implement the requested change." | ./bin/x86_64/linux/llm \
  model.gguf \
  --kv-load checkpoint.kv \
  --kv-save session.kv

KV snapshots are tied to the loaded model dimensions, context size, KV head count, head dimension, layer count, and F32 KV tensor type. Incompatible files fail during load instead of being applied.


Build

Compiled artifacts are generated under bin/{arch}/{platform}/ for the host architecture running the build.

make clean && make

CUDA support is opt-in. Pass CUDA=1 to request a CUDA-enabled build. The flag only has an effect for supported targets and only when the build machine has a usable CUDA toolkit; otherwise the build remains CPU-only.

make CUDA=1
make CUDA=1 x86_64/linux

When GPU support is available, the engine follows these semantics:

  • --gpu -1 (Auto): Uses GPU if a compatible device is found, falls back to CPU otherwise.
  • --gpu 0: Disables GPU and strictly uses the CPU backend.
  • --gpu >0: Explicitly requires a GPU. Fails with a descriptive error if CUDA support was not enabled at build time or no compatible device is found at runtime.
  • --gpu-layers N: Controls how many transformer layers are offloaded to VRAM. Any remaining layers are kept in RAM. This allows running large models on limited hardware by combining GPU and CPU resources.

CUDA Output Directory

When building with CUDA=1, artifacts are placed in a separate cuda/ subdirectory to keep CPU and GPU builds separate:

bin/x86_64/linux/llm               # CPU-only build (make)
bin/x86_64/linux/cuda/llm          # GPU-enabled build (make CUDA=1)
bin/x86_64/linux/libllm.so         # CPU-only shared library
bin/x86_64/linux/cuda/libllm.so    # GPU-enabled shared library

Each target uses an isolated CMake build directory (.build/{arch}-linux-cuda/) to avoid cache conflicts, but only when the architecture is CUDA-capable (x86_64 or aarch64). When CUDA=1 is set for an architecture that does not support CUDA, the build proceeds CPU-only with a notification and the output goes to the standard output directory without the cuda/ subdirectory.

make           # builds CPU-only → bin/x86_64/linux/llm
make CUDA=1    # builds with CUDA if supported → bin/x86_64/linux/cuda/llm
make CUDA=1 all  # CUDA on capable archs, CPU with notice on others

To test the CUDA binary:

TEST_BIN=./bin/x86_64/linux/cuda/llm sh test.sh

Multiarch Builds

The project is prepared to build artifacts for multiple architectures under bin/{arch}/{platform}/. A plain make builds only the current host architecture, while the targets below build the full matrix or a specific target.

make all
make x86_64/linux
make x86_64/windows
make i686/linux
make i686/windows
make aarch64/linux
make aarch64/android
make armv7/linux
make armv7/android
make armv7hf/linux
make riscv64/linux
make powerpc64le/linux
make mips/linux
make mipsel/linux
make mips64el/linux
make s390x/linux
make loongarch64/linux

Dependencies

PathDescription
lib/ggml/Tensor computation library for machine learning
lib/kaisarcode/gguf/GGUF I/O and tokenizers library
lib/model.ggufEmbedded model weights

Troubleshooting

CUDA cross-compilation

See etc/docs/cuda-cross-compile.md.


Beta Notice

This is a beta project tested only on Debian x86_64. It was created out of a personal need for these libraries, but no guarantees are provided regarding its stability or future support. You are free to test it, use it, and modify it as you please.

If you'd like to reach out, you can send an email to [email protected]. Please note that I do not accept pull requests; the goal is to avoid long-term dependency on platforms like GitHub, and I do not maintain fixed infrastructure to guarantee long-term stability for these projects.


Repo

You can download the repository and read the most up-to-date documentation directly from its official source.

GitHub: kaisarcode/llm.c

License

GPLv3

This project is distributed under the GNU General Public License version 3 (GPLv3).