llm.c

llm.c - LLM Inference Engine

llm.c is a portable C library and CLI for native LLM inference using GGUF models and GGML. It loads transformer model weights, tokenizes prompt text from GGUF metadata, builds the inference graph, applies optional LoRA adapters, generates tokens with sampling controls, and streams generated text through a small C callback API.

The library is designed as a standalone inference primitive for C/C++ applications. A kc_llm_t context owns the loaded model, backend resources, LoRA adapters, error state, and optional KV-cache used to accelerate generation.

CLI

Run native GGUF text generation from standard input. The CLI opens a model once and runs a resident request loop: each request is read from stdin until the --until delimiter byte (default EOT, byte 4), inference runs, the response is streamed to stdout, and the same delimiter is written at the end of each response. When stdin closes the process exits cleanly.

Examples

Single response generation:

echo "What is the capital of France?" | ./bin/x86_64/linux/llm --model llama-3-8b.gguf

With a LoRA adapter:

printf '%s\n' "Hello" | ./bin/x86_64/linux/llm \
  --model base.gguf \
  --lora adapter.safetensors --lora-scale 0.8

Parameters

Flag	Description	Default
`-h`, `--help`	Show help and usage	-
`-v`, `--version`	Show version	-
`--model PATH`	Path to GGUF model file (required)	-
`--ctx N`	Context size in tokens	2048
`--predict N`	Max tokens to predict	128
`--threads N`	Number of threads	auto
`--gpu N`	GPU mode: -1 auto, 0 CPU, > 0 require GPU	-1
`--gpu-layers N`	Number of layers to offload to GPU	all
`--kv-cache N`	Enable (1) or disable (0) KV-caching	1
`--temp F`	Temperature for sampling	0.80
`--top-k N`	Top-k sampling parameter	40
`--top-p F`	Top-p sampling parameter	0.95
`--penalty F`	Repeat penalty parameter	1.10
`--repeat-last-n N`	Last tokens for penalty	64
`--seed N`	RNG seed (-1 for random)	-1
`--until N`	Request/response delimiter byte	4 (EOT)
`--lora PATH`	Apply a LoRA adapter (repeatable)	-
`--lora-scale F`	Scale for the previous LoRA	1.0

Output

Generated text is written directly to standard output as it is produced, followed by the --until delimiter byte. Diagnostics and errors are written to standard error.

Compatibility

llm.c supports the GGUF model families implemented by its local graph builders and tokenizer backends:

Llama-style: llama, mistral, mixtral — SiLU activation, RoPE, standard GQA
Qwen-style: qwen2, qwen2.5, qwen3 — SiLU activation, RoPE with Qwen freq base, Qwen3 Q/K RMS normalization
Gemma-style: gemma — GELU activation, embedding scale √n_embd

The engine requires all mandatory tensors (token embeddings, output norm, and standard transformer block weights) to be present in the GGUF file. Quantized models (Q4_0, Q4_K_M, Q8_0, etc.) are supported via GGML.

Tokenization

Prompt text is encoded by the tokenizer implementation selected from the GGUF metadata. The engine currently implements:

BPE: GPT-2 byte-level BPE (tokenizer.ggml.model = gpt2) with gpt2, llama-bpe, and qwen2 pre-tokenizers.
SentencePiece: Google SentencePiece (tokenizer.ggml.model = llama), used by LLaMA, Mistral, and Gemma models.
Unigram: Google Unigram (tokenizer.ggml.model = unigram).

Model compatibility is therefore bounded by both the architecture metadata and the tokenizer metadata. Unsupported architectures, tokenizer models, or tokenizer pre-tokenizers fail during model load with a clear error.

Public API

#include "llm.h"

kc_llm_options_t opts = { .model_path = "model.gguf", .ctx = 2048, .predict = 128 };
kc_llm_t *ctx = NULL;

if (kc_llm_open(&ctx, &opts) == 0) {
    kc_llm_generate(ctx, "Hello!", write_callback, user_data);
    kc_llm_close(ctx);
}

Memory & Ownership

llm.c uses a clear ownership model to ensure predictable memory behavior:

Options: kc_llm_options_t is copied during kc_llm_open(). You can release your copy immediately.
Callbacks: The buffer provided to the kc_llm_write_fn callback is owned by the library and is only valid for the duration of that specific callback execution.
Errors: kc_llm_error() returns a pointer to a context-owned string. It remains valid until the next state-modifying call on that context.
Generation: The caller owns prompt storage before and after each kc_llm_generate() call.

Lifecycle

kc_llm_open() - allocates and prepares a new LLM context with specific options.
kc_llm_lora_apply() - loads and registers a safetensors LoRA adapter.
kc_llm_lora_clear() - releases all applied adapters.
kc_llm_generate() - performs synchronous generation from input. Supports streaming via callback.
kc_llm_stop() - thread-safe mechanism to stop an ongoing generation.
kc_llm_close() - releases the context and all associated resources.

Build

Compiled artifacts are generated under bin/{arch}/{platform}/ for the host architecture running the build.

make clean && make

CUDA support is opt-in. Pass CUDA=1 to request a CUDA-enabled build. The flag only has an effect for supported targets and only when the build machine has a usable CUDA toolkit; otherwise the build remains CPU-only.

make CUDA=1
make CUDA=1 x86_64/linux

When GPU support is available, the engine follows these semantics:

--gpu -1 (Auto): Uses GPU if a compatible device is found, falls back to CPU otherwise.
--gpu 0: Disables GPU and strictly uses the CPU backend.
--gpu >0: Explicitly requires a GPU. Fails with a descriptive error if CUDA support was not enabled at build time or no compatible device is found at runtime.
--gpu-layers N: Controls how many transformer layers are offloaded to VRAM. Any remaining layers are kept in RAM. This allows running large models on limited hardware by combining GPU and CPU resources.

Multiarch Builds

The project is prepared to build artifacts for multiple architectures under bin/{arch}/{platform}/. A plain make builds only the current host architecture, while the targets below build the full matrix or a specific target.

make all
make x86_64/linux
make x86_64/windows
make i686/linux
make i686/windows
make aarch64/linux
make aarch64/android
make armv7/linux
make armv7/android
make armv7hf/linux
make riscv64/linux
make powerpc64le/linux
make mips/linux
make mipsel/linux
make mips64el/linux
make s390x/linux
make loongarch64/linux

License

This project is distributed under the GNU General Public License version 3 (GPLv3).

Repo

GitHub: kaisarcode/llm.c