llama.cpp Portable LLM Inference Engine in C/C++
llama.cpp is a high-performance C/C++ implementation for running LLM inference across diverse hardware. It supports GGUF model quantization, GPU acceleration on NVIDIA/AMD/Apple Silicon, and provides both a CLI and an OpenAI-compatible HTTP server for local model serving.
What it does
llama.cpp Portable LLM Inference Engine in C/C++
llama.cpp is a high-performance C/C++ implementation for running LLM inference across diverse hardware. It supports GGUF model quantization, GPU acceleration on NVIDIA/AMD/Apple Silicon, and provides both a CLI and an OpenAI-compatible HTTP server for local model serving.
Installation
Use the upstream install or setup path that matches your environment:
- Run with Docker - see our Docker documentation
Requirements and caveats from upstream:
- Python: ddh0/easy-llama
- Python: abetlen/llama-cpp-python
- Node.js: withcatai/node-llama-cpp
Basic usage or getting-started notes:
-
Install llama.cpp using brew, nix or winget
-
Download pre-built binaries from the releases page
-
Build from source by cloning this repository - check out our build guide
-
Extracted from upstream docs: https://raw.githubusercontent.com/ggml-org/llama.cpp/HEAD/README.md
Source
Capabilities
Install
Quality
deterministic score 0.45 from registry signals: · indexed on github topic:agent-skills · 8 github stars · SKILL.md body (1,307 chars)