README

Local LLM inference for PHP, in-process.
Chat, embeddings, and reasoning models — no Python sidecar, no remote API.

What is ext-infer?

ext-infer is a PHP 8.3+ extension that loads a GGUF model and runs inference in the PHP process via llama.cpp. PHP-native semantic search, RAG pipelines, and CLI/worker inference work without shelling out to Python or hitting a remote API.

Written in Rust on top of ext-php-rs and the llama-cpp-2 bindings. The public PHP surface is fluent and role-aware — building a chat prompt looks like Prompt::system(...)->withUser(...), not a string of <|im_start|> tokens.

💬 Chat completions via an immutable Prompt builder that renders through the model's embedded template — no manual <|im_start|> plumbing.
🧱 Structured output — pass a JSON Schema (or raw GBNF grammar) and sampling is constrained so malformed output is impossible, not retried. A 0.6B model becomes a dependable extractor.
🧠 Reasoning-model aware — Response::answer() and Response::reasoning() split Qwen3 / R1-style <think>…</think> output automatically.
📊 Embeddings — Model::embed() returns an Embedding with dimensions(), normalize(), cosineSimilarity(), and packed() (zero-copy handoff to vector indexes) built in.
🎯 Reranking — RerankModel scores (query, document) pairs through Qwen3-Reranker's calibrated yes/no judgment; completes the embed → rerank two-stage retrieval pipeline.
⚡ In-process — no subprocess fork, no IPC, no daemon. Latency is whatever the model takes to decode.
🛠️ Apple Metal acceleration is opt-in (make release FEATURES=metal); CPU is the portable default.
🧵 Thread-safe — LlamaBackend is a Sync-guarded singleton and each call builds its own context, so ZTS PHP + parallel works by design.

Quick start

mkdir -p models
curl -L -o models/Qwen3-0.6B-Q8_0.gguf \
    https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q8_0.gguf

<?php
use Displace\Infer\Model;
use Displace\Infer\Prompt;

$model    = Model::load('models/Qwen3-0.6B-Q8_0.gguf');
$response = $model->chat(
    Prompt::system('You are a helpful assistant.')
        ->withUser('What is 2+2?'),
    maxTokens: 256,
    temperature: 0.0,
);

echo $response->answer(), PHP_EOL;   // "2 + 2 equals 4."
echo $response->reasoning() ?? '';    // captured <think>…</think>, if any

$model->close();

make build       # produces target/debug/libinfer.{so,dylib}
php -d extension=$PWD/target/debug/libinfer.dylib hello.php

Full walkthrough — including the interactive Symfony Console chat and pairwise-similarity embedding example — under examples/.

Documentation

infer.displace.tech hosts the full guide:

Getting started — install via PIE or from source, verify, troubleshoot.
Guide — prompts, chat, raw, embeddings, choosing a model.
Recipes — multi-turn chat, semantic search, RAG over markdown, worker pools.
Reference — full API surface, exceptions, environment variables, compatibility matrix.
Advanced — threading, Metal, performance tuning.

The site is built from docs/ with mdbook and deploys automatically on every push to main.

Compatibility

	macOS arm64	Linux x86_64	Linux arm64	Windows
PHP 8.3	✅	✅	✅	—
PHP 8.4	✅	✅	✅	—
PHP 8.5	✅	✅	✅	—

ZTS is supported by design (the code is thread-safe), enabled in composer.json, and not yet exercised in CI. Windows is intentionally out of scope for v0.1.

Roadmap

Shipped chat completions · raw completions · grammar/JSON-Schema constrained generation · embeddings (+ packed float32 output) · RerankModel · reasoning split · typed exceptions · PHPT suite · CI matrix · PIE-compatible composer.json · tag-triggered binary release workflow · THIRD-PARTY-NOTICES + cargo about license manifest.

Next (v0.3+) streaming completions · KV-cache reuse via reusable Session objects · stop-string support · LoRA adapters · tool calling · Apple Metal default on macos-arm64.

See PLAN.md for the current planning doc and RELEASE.md for the cut-a-release flow.

displace / ext-infer

Maintainers

Package info

Statistics

Security