gguf.c

gguf.c - GGUF utilities

gguf.c is a library and CLI for working with GGUF model files. It parses model headers and metadata, loads tokenizers (GPT-2 BPE, SentencePiece, Unigram), and constructs transformer compute graphs via GGML for Llama, Mistral, Qwen2/2.5/3, Gemma, and GPT-2 architectures. No inference runtime, KV cache, or LoRA - suitable for model inspection, tokenization, and graph prototyping without the full inference engine.

Supported Architectures

Architecture	Graph builder
`llama`, `mistral`, `mixtral`	`kc_gguf_build_graph_llama`
`qwen2`, `qwen2.5`, `qwen3`	`kc_gguf_build_graph_llama`
`gemma`	`kc_gguf_build_graph_gemma`
`gpt2`	`kc_gguf_build_graph_gpt2`

Quantization Types

The --quantize / -q operation converts between any two types, using F32 as an intermediate when needed.

Float (no quantization)

Type	Bits/element
`F32`	32
`F16`	16
`BF16`	16

Quantization targets

These types can be used as --quantize targets:

Type	Bits/element	Category
`Q1_0`	~1	1-bit
`Q4_0`	~4	Legacy
`Q4_1`	~4.5	Legacy
`Q5_0`	~5	Legacy
`Q5_1`	~5.5	Legacy
`Q8_0`	~8	Legacy
`Q2_K`	~2	K-quant
`Q3_K`	~3	K-quant
`Q4_K`	~4	K-quant
`Q5_K`	~5	K-quant
`Q6_K`	~6	K-quant
`IQ2_S`	~2	I-quant
`IQ3_XXS`	~3	I-quant
`IQ3_S`	~3	I-quant
`IQ4_NL`	~4	I-quant
`IQ4_XS`	~4	I-quant
`TQ1_0`	~1	Ternary
`TQ2_0`	~2	Ternary
`MXFP4`	4	4-bit format
`NVFP4`	4	NV 4-bit format

Dequantize-only types

These types exist in GGUF files and can be dequantized to F32/F16, but cannot be used as quantization targets:

Type	Bits/element	Category
`Q8_1`	~8.5	Legacy
`IQ1_S`	~1	I-quant
`IQ1_M`	~1.5	I-quant
`IQ2_XXS`	~2	I-quant
`IQ2_XS`	~2	I-quant

CLI

Inspect model metadata, tokenize text, or decode token IDs from standard input.

Examples

Print model metadata:

./bin/x86_64/linux/gguf model.gguf --info

Tokenize text:

./bin/x86_64/linux/gguf model.gguf --tokenize "Hello world"

Decode token IDs (one per line, from stdin):

./bin/x86_64/linux/gguf model.gguf --detokenize < ids.txt

Quantize a GGUF file:

./bin/x86_64/linux/gguf model.gguf -q Q4_K -o model-q4.gguf

Parameters

Flag	Description
`MODEL` (positional)	GGUF model path (required)
`-i`, `--info`	Print model metadata
`-t`, `--tokenize`	Tokenize input text, print token IDs
`-d`, `--detokenize`	Decode token IDs from stdin to text
`-q`, `--quantize TYPE`	Quantize a GGUF file
`-o`, `--output PATH`	Output path for quantized file (required with --quantize)
`-h`, `--help`	Show help and usage
`-v`, `--version`	Show version

Public API

#include "gguf.h"

Model loading

int kc_gguf_model_load(kc_gguf_model_t **out, const char *path);
void kc_gguf_model_free(kc_gguf_model_t *m);
const char *kc_gguf_error(const kc_gguf_model_t *m);

kc_gguf_model_t is a flat exposed struct with model dimensions and weight tensor handles. No inference state, no KV cache, no LoRA.

Metadata helpers

uint32_t kc_gguf_get_arch_u32(ctx, arch, field, def);
float   kc_gguf_get_arch_f32(ctx, arch, field, def);
uint32_t kc_gguf_get_kv_u32(ctx, key, def);
float   kc_gguf_get_kv_f32(ctx, key, def);
int     kc_gguf_get_arch(ctx, arch, arch_size);

Graph builders

struct ggml_tensor *kc_gguf_build_graph_impl(cctx, m, n_tokens, gf, embd_out, pos_out, params);
struct ggml_tensor *kc_gguf_build_graph_llama(cctx, m, n_tokens, gf, embd_out, pos_out);
struct ggml_tensor *kc_gguf_build_graph_gemma(cctx, m, n_tokens, gf, embd_out, pos_out);
struct ggml_tensor *kc_gguf_build_graph_gpt2(cctx, m, n_tokens, gf, embd_out, pos_out);

Tokenizer

int         kc_gguf_tokenizer_load(out, gguf, n_vocab, error, error_size);
int         kc_gguf_tokenizer_encode(tok, input, tokens, max_tokens, error, error_size);
const char *kc_gguf_tokenizer_decode(tok, id);
int         kc_gguf_tokenizer_bos(tok);
int         kc_gguf_tokenizer_eos(tok);
int         kc_gguf_tokenizer_add_bos(tok);
int         kc_gguf_tokenizer_unk(tok);
int         kc_gguf_tokenizer_is_eog(tok, id);
void        kc_gguf_tokenizer_free(tok);

Build

Compiled artifacts are generated under bin/{arch}/{platform}/ for the host architecture running the build.

make clean && make

Multiarch Builds

The project is prepared to build artifacts for multiple architectures under bin/{arch}/{platform}/. A plain make builds only the current host architecture.

make all
make x86_64/linux
make x86_64/windows
make i686/linux
make i686/windows
make aarch64/linux
make aarch64/android
make armv7/linux
make armv7/android
make armv7hf/linux
make riscv64/linux
make powerpc64le/linux
make mips/linux
make mipsel/linux
make mips64el/linux
make s390x/linux
make loongarch64/linux

Dependencies

Path	Description
`lib/ggml/`	Tensor computation library for machine learning

Beta Notice

This is a beta project tested only on Debian x86_64. It was created out of a personal need for these utilities, but no guarantees are provided regarding its stability or future support. You are free to test it, use it, and modify it as you please.

If you'd like to reach out, you can send an email to [email protected]. Please note that I do not accept pull requests; the goal is to avoid long-term dependency on platforms like GitHub, and I do not maintain fixed infrastructure to guarantee long-term stability for these projects.

Repo

You can download the repository and read the most up-to-date documentation directly from its official source.

GitHub: kaisarcode/gguf.c

License

This project is distributed under the GNU General Public License version 3 (GPLv3).