tag1 / scolta-php
AI-powered search with Pagefind — PHP language binding
Requires
- php: >=8.1
- guzzlehttp/guzzle: ^7.0
- psr/log: ^3.0
- wamania/php-stemmer: ^3.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- phpunit/phpunit: ^10.0|^11.0
Suggests
- ext-intl: Improves Unicode diacritic normalization quality. Falls back to strtr() mapping without it.
- dev-main / 1.0.x-dev
- 1.0.0-rc4
- 1.0.0-rc3
- 1.0.0-rc2
- 1.0.0-rc1
- 0.3.10
- 0.3.9
- 0.3.8
- 0.3.7
- 0.3.6
- 0.3.5
- 0.3.4
- 0.3.3
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.4
- 0.2.3
- 0.2.2
- 0.2.0
- dev-feature/sort-filter-discovery
- dev-fix/sort-intersection-fallback
- dev-fix/sort-order-results
- dev-fix/expand-query-site-context-disambiguation
- dev-fix/composer-archive-exclude-tests
- dev-fix/e2e-flaky-build-lock
- dev-feat/show-attribution-config
- dev-fix/gitattributes-export-ignore
- dev-fix/follow-up-numbered-result-references
- dev-fix/sort-intent-prompt-quality
- dev-feat/ai-overview-metadata-enrichment
- dev-fix/zero-sortable-value-in-meta
- dev-feat/sortable-cached-references
- dev-feature/generic-sort-filter-prompts
- dev-fix/voluntary-restart-oom
- dev-feat/auto-date-sortable
- dev-fix/sort-subject-filter
- dev-fix/token-class-memory
- dev-fix/native-pagefind-sort
- dev-fix/pre-demo-sync-bugfixes
- dev-fix/streaming-writer-sort-data
- dev-fix/pagefind-sort-data-in-cbor
- dev-feat/wire-sort-hint-search-flow
- dev-feat/intent-classification-expansion
- dev-feat/metadata-sort-attributes
- dev-fix/wire-expand-primary-weight
- dev-fix/auto-language-filter-lang-switch
- dev-fix/filter-cbor-match-reference
- dev-fix/filter-cbor-inner-flat-values
- dev-fix/filter-cbor-flat-structure
- dev-fix/config-type-coercion
- dev-fix/release-prerelease-tags
- dev-fix/summary-truncation
- dev-fix/summarize-context-overflow
- dev-renovate/configure
This package is auto-updated.
Last update: 2026-05-20 22:23:05 UTC
README
PHP library that indexes content into Pagefind-compatible search indexes, plus the shared orchestration, memory-budget management, and AI client used by Scolta's CMS adapters.
Status
Scolta 1.0 — the API documented here is stable. Breaking changes follow semantic versioning: no removal or signature change without a major version bump and a deprecation cycle. File bugs at the repo issue tracker.
What Is Scolta?
Scolta is a scoring, ranking, and AI layer built on Pagefind. Pagefind is the search engine: it builds a static inverted index at publish time, runs a browser-side WASM search engine, produces word-position data, and generates highlighted excerpts. Scolta takes Pagefind's result set and re-ranks it with configurable boosts — title match weight, content match weight, recency decay curves, and phrase-proximity multipliers. No search server required. Queries resolve in the visitor's browser against the pre-built static index.
This package is the PHP foundation for all three CMS adapters. It handles the parts that are the same regardless of platform: indexing content to Pagefind-compatible HTML files, AI provider communication, configuration management, memory budgeting, and the shared browser assets (scolta.js, scolta.css, and the pre-built WASM module). The CMS adapters (scolta-drupal, scolta-laravel, scolta-wp) depend on this package and add only their platform-specific concerns.
The LLM tier — query expansion, result summarization, follow-up questions — is optional. When enabled, it sends the query text and selected result excerpts to a configured LLM provider. The base search tier shares nothing with any third party; it runs entirely in the visitor's browser.
Running Example
The examples in this README and the other Scolta repos use a recipe catalog as the concrete data set. Recipes are a good showcase because recipe vocabulary has cross-dialect mismatches that basic keyword search handles poorly:
- A search for
aubergine parmesanshould surface Eggplant Parmigiana. - A search for
chinese noodle soupshould surface Lanzhou Beef Noodles, Wonton Soup, and Dan Dan Noodles. - A search for
gluten free pastashould surface Zucchini Spaghetti with Pesto and Rice Noodle Stir-Fry. - A search for
quick dinner under 30 minshould surface Pad Kra Pao, Dan Dan Noodles, Steak Frites, and others.
The recipe fixture lives at tests/fixtures/recipes/ — 20 HTML files in Pagefind-compatible format, one per recipe.
Here is how to index the recipe catalog outside any CMS, using the IndexBuildOrchestrator directly:
<?php require_once __DIR__ . '/vendor/autoload.php'; use Tag1\Scolta\Export\ContentItem; use Tag1\Scolta\Index\IndexBuildOrchestrator; use Tag1\Scolta\Index\BuildIntent; use Tag1\Scolta\Config\MemoryBudget; // Load the 20 recipe HTML files from the fixture directory $fixtures = glob(__DIR__ . '/tests/fixtures/recipes/*.html'); $items = []; foreach ($fixtures as $file) { $dom = new DOMDocument(); @$dom->loadHTMLFile($file); $body = $dom->getElementById(basename($file, '.html')); $id = pathinfo($file, PATHINFO_FILENAME); $title = $dom->getElementsByTagName('title')[0]->textContent; $items[] = new ContentItem( id: $id, title: $title, bodyHtml: $dom->saveHTML($body), url: '/recipes/' . $id, date: '2024-03-01', siteName: 'Recipe Catalog', ); } // Run the build using the conservative memory profile (96 MB internal budget) $orchestrator = new IndexBuildOrchestrator( stateDir: '/tmp/scolta-state', outputDir: '/var/www/html/pagefind', language: 'en', ); $result = $orchestrator->build( intent: BuildIntent::fresh(count($items), MemoryBudget::conservative()), pages: $items, ); printf("Indexed %d recipes in %.1fs\n", $result->pageCount, $result->elapsedSeconds); // Indexed 20 recipes in 0.3s
After indexing, the /var/www/html/pagefind/ directory contains a Pagefind-compatible static index. Point a browser at it and load scolta.js to get a working search UI with vocabulary-mismatch handling.
Installation
composer require tag1/scolta-php:^1.0
Requirements: PHP 8.1+, ext-intl (Unicode tokenization).
Platform adapters install this package automatically. Install it directly only when building a custom adapter or a non-CMS integration.
Configuration and Quickstart
All Scolta configuration flows through Tag1\Scolta\Config\ScoltaConfig. Construct it with ScoltaConfig::fromArray():
use Tag1\Scolta\Config\ScoltaConfig; $config = ScoltaConfig::fromArray([ // AI provider (optional — omit for base search only) 'ai_provider' => 'anthropic', 'ai_api_key' => getenv('SCOLTA_API_KEY'), 'ai_model' => 'claude-sonnet-4-5-20250929', 'ai_expand_query' => true, 'ai_summarize' => true, // Scoring — tuned for a recipe catalog (no recency, title precision) 'scoring' => [ 'title_match_boost' => 1.5, 'title_all_terms_multiplier' => 2.0, 'content_match_boost' => 0.4, 'recency_strategy' => 'none', 'language' => 'en', ], // Site identity (used in AI prompts) 'site_name' => 'Recipe Catalog', 'site_description' => 'a collection of 20 international recipes', ]);
For the full list of config keys and their defaults, see docs/CONFIG_REFERENCE.md.
What Scolta Is Built For
Scolta is designed for content search on publishing platforms: pages, posts, documentation, product catalogs, and other human-authored content indexed at build time. This package is the PHP foundation shared by the Drupal, WordPress, and Laravel adapters — the platforms behind enterprise content operations, government and university portals, media publishing, and product-driven businesses.
The static-index architecture eliminates the search server. No Solr, no Elasticsearch, no hosted SaaS subscription to operate or pay for. Scolta replaces those for content sites where the search use case is full-text relevance, recency, and phrase matching. Teams on managed hosting (WP Engine, Kinsta, Pantheon, Flywheel) where exec() is disabled will find the PHP indexer runs there without any configuration change.
Memory and Scale
Memory profiles control Scolta's internal allocation budget — the memory Scolta itself adds on top of what the PHP process already uses. Total process RSS is higher: it includes the PHP runtime baseline for your platform plus the Scolta budget plus ~15 MB I/O overhead.
Typical platform baselines (before any indexing work):
| Platform | Baseline RSS |
|---|---|
| Laravel CLI | ~60 MB |
| WordPress | ~80 MB |
| Drupal | ~130 MB |
The default profile is conservative (96 MB internal budget). On WordPress, expect total peak RSS around 175 MB; on Drupal, around 240 MB. Scolta never silently upgrades to a larger profile. To opt in to a larger profile:
use Tag1\Scolta\Config\MemoryBudget; use Tag1\Scolta\Config\MemoryBudgetSuggestion; // Auto-detect and suggest a profile based on the current PHP memory_limit $suggestion = MemoryBudgetSuggestion::suggest(); // $suggestion->profile is 'conservative', 'balanced', or 'aggressive' // $suggestion->warning is non-empty if the limit is tight // Or specify directly $budget = MemoryBudget::balanced(); // internal budget: 384 MB $budget = MemoryBudget::aggressive(); // internal budget: 1 GB // Or pass a budget in bytes $budget = MemoryBudget::fromBytes(256 * 1024 * 1024);
The trade-off: a larger budget means fewer, larger index chunks and faster builds. The conservative profile is always the default and always safe to use.
Tested ceiling at the conservative profile: 50,000 pages. Higher counts likely work; not certified yet.
You can also pass the profile string at the CLI via --memory-budget=balanced if the CMS adapter supports the flag.
AI Features and Privacy
Scolta's AI tier is optional. When enabled:
- The LLM receives: the query text, and the titles and excerpts of the top N results (default: 5, configurable via
ai_summary_top_n). - The LLM does not receive: the full index contents, full page text, user session data, or visitor identity.
- Which provider receives the query data depends on your
ai_providersetting:anthropic,openai, or a self-hosted endpoint viaai_base_url.
The base search tier — Pagefind index lookup and Scolta WASM scoring — runs entirely in the visitor's browser with no server-side involvement beyond serving the static index files.
Optional Upgrades
Indexer options
Both indexers produce the same Pagefind-compatible index. The search experience is identical either way. Choose based on your hosting constraints.
PHP indexer (the default): runs everywhere, no binary required. Around 3–4 seconds per 1,000 pages. Supports 14 languages via Snowball stemming (Catalan, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish).
Pagefind binary indexer: 5–10× faster. Requires Node.js ≥ 18 or a direct binary download. Supports 33+ languages. Better for large sites or environments where the binary is installable.
On managed hosting (WP Engine, Kinsta, Flywheel, Pantheon), exec() is disabled. The PHP indexer runs there automatically with no configuration change.
To install the binary:
# Download via the CLI command (no Node.js required): wp scolta download-pagefind # WordPress drush scolta:download-pagefind # Drupal php artisan scolta:download-pagefind # Laravel # Or install via npm (Node.js ≥ 18 required): npm install -g pagefind
indexer: auto (the default) uses the binary when available and falls back to PHP automatically.
Language support for the PHP indexer
For languages outside the 14 supported by Snowball, search works but inflected forms ("running", "ran") will not match a stemmed base ("run"). CJK languages (Chinese, Japanese, Korean) use character-level tokenization and do not require stemming. For full 33+ language stemming coverage, use the Pagefind binary indexer.
Debugging
"ext-intl not found"
# Debian/Ubuntu sudo apt-get install php8.1-intl # macOS (Homebrew) brew install php
Verify: php -m | grep intl
"PhpIndexer produces empty output"
Verify ext-intl is loaded and that the ContentItem objects passed to the indexer have non-empty bodyHtml. The indexer skips items where the cleaned text is shorter than 50 characters.
"AI calls failing"
- Confirm the API key: check
SCOLTA_API_KEYenv var or the platform-specific constant. - Check the model identifier — model names change with provider releases. Default:
claude-sonnet-4-5-20250929. - Enable request logging: set
SCOLTA_DEBUG=1to log raw request/response bodies via Guzzle.
"Scoring results look wrong"
The browser-side WASM scorer (scolta-core) runs via wasm-bindgen. If results appear unscored or identically ranked, confirm both pagefind.js and scolta_core_bg.wasm are loading without 404 errors in the browser console.
Configuration Reference
All Scolta configuration flows through Tag1\Scolta\Config\ScoltaConfig. Platform adapters map their native config systems into this object via ScoltaConfig::fromArray(), which accepts snake_case keys.
AI Provider
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
aiProvider |
ai_provider |
string | anthropic |
AI provider (anthropic or openai) |
aiApiKey |
ai_api_key |
string | '' |
API key for the AI provider |
aiModel |
ai_model |
string | claude-sonnet-4-5-20250929 |
Model identifier |
aiBaseUrl |
ai_base_url |
string | '' |
Custom API base URL (empty = provider default) |
aiExpandQuery |
ai_expand_query |
bool | true |
Enable AI query expansion |
aiSummarize |
ai_summarize |
bool | true |
Enable AI result summarization |
aiSummaryTopN |
ai_summary_top_n |
int | 5 |
Number of top results sent to AI for summarization |
aiSummaryMaxChars |
ai_summary_max_chars |
int | 2000 |
Maximum characters of content sent to AI for summarization |
aiLanguages |
ai_languages |
array | ['en'] |
Supported languages for AI responses. With multiple languages, the AI responds in the user's query language if it matches; otherwise falls back to the primary (first) language. |
Scoring: Recency
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
recencyStrategy |
recency_strategy |
string | exponential |
Decay function: exponential, linear, step, none, or custom (piecewise-linear) |
recencyCurve |
recency_curve |
array | [] |
Control points for custom strategy: [[days, boost], …] sorted ascending |
recencyBoostMax |
recency_boost_max |
float | 0.5 |
Maximum positive boost for recent content |
recencyHalfLifeDays |
recency_half_life_days |
int | 365 |
Half-life for recency decay (days) |
recencyPenaltyAfterDays |
recency_penalty_after_days |
int | 1825 |
Age threshold before penalty applies (~5 years) |
recencyMaxPenalty |
recency_max_penalty |
float | 0.3 |
Maximum penalty for old content |
Scoring: Title/Content Match
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
titleMatchBoost |
title_match_boost |
float | 1.0 |
Boost for title keyword matches |
titleAllTermsMultiplier |
title_all_terms_multiplier |
float | 1.5 |
Multiplier when all search terms appear in title |
contentMatchBoost |
content_match_boost |
float | 0.4 |
Boost for content/excerpt keyword matches |
expandPrimaryWeight |
expand_primary_weight |
float | 0.7 |
Weight given to original query results vs expanded results during merge |
Scoring: Language
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
language |
language |
string | en |
ISO 639-1 language code for stop word filtering. 30 languages supported; unknown codes apply no stop word filtering. |
customStopWords |
custom_stop_words |
array | [] |
Additional stop words beyond the language's built-in list |
Display
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
excerptLength |
excerpt_length |
int | 300 |
Maximum excerpt length in characters |
resultsPerPage |
results_per_page |
int | 10 |
Results shown per page |
maxPagefindResults |
max_pagefind_results |
int | 50 |
Maximum results fetched from Pagefind |
Site Identity
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
siteName |
site_name |
string | '' |
Site name used in AI prompts |
siteDescription |
site_description |
string | website |
Site description used in AI prompts |
searchPagePath |
search_page_path |
string | /search |
Path to the search page |
pagefindIndexPath |
pagefind_index_path |
string | /pagefind |
URL path to the Pagefind index directory |
Caching
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
cacheTtl |
cache_ttl |
int | 2592000 |
Cache TTL in seconds (default: 30 days) |
maxFollowUps |
max_follow_ups |
int | 3 |
Maximum follow-up questions per session |
Prompts
| Property | snake_case key | Type | Default | Description |
|---|---|---|---|---|
promptExpandQuery |
prompt_expand_query |
string | '' |
Custom prompt for query expansion (empty = use DefaultPrompts) |
promptSummarize |
prompt_summarize |
string | '' |
Custom prompt for summarization (empty = use DefaultPrompts) |
promptFollowUp |
prompt_follow_up |
string | '' |
Custom prompt for follow-up conversations (empty = use DefaultPrompts) |
For per-platform key mapping (e.g., Drupal scoring.recency_boost_max vs. WordPress recency_boost_max vs. Laravel scoring.recency_boost_max), see docs/CONFIG_REFERENCE.md.
Architecture
Platform Adapters scolta-php (this package) scolta-core (browser WASM)
(Drupal / WP / Laravel)
ContentGatherer ─────────> ContentExporter ──────────> HtmlCleaner
CLI build command ────────> IndexBuildOrchestrator PagefindHtmlBuilder
AiService ───────────────> AiClient
SettingsForm ────────────> ScoltaConfig
SearchPage ──────────────> DefaultPrompts Scoring runs in browser
CacheDriver ─────────────> CacheDriverInterface via scolta.js + WASM
What lives here:
ScoltaConfig— platform-agnostic configuration with scoring defaultsAiClient— provider-agnostic HTTP client for Anthropic and OpenAI APIsAiEndpointHandler— shared expand / summarize / follow-up logicContentExporter— exports content items to Pagefind-compatible HTML filesIndexBuildOrchestrator— single authoritative chunk-loop entry point for all adaptersMemoryBudget/MemoryBudgetSuggestion— memory profile managementPhpIndexer— pure PHP indexer producing Pagefind-compatible index filesHtmlCleaner— HTML cleaning for content extractionDefaultPrompts— prompt templates with variable resolution (pure PHP, no WASM)PagefindBinary— binary resolver and downloader- Shared assets —
scolta.js,scolta.css, browser WASM
Scoring runs entirely in the browser via the WASM module loaded by scolta.js. The PHP server handles content indexing, AI API proxying, and configuration only.
Testing
composer install ./vendor/bin/phpunit
Credits
Scolta is built on Pagefind by CloudCannon. Without Pagefind, Scolta has no search to score — the index format, WASM search engine, word-position data, and excerpt generation are all Pagefind's. Scolta's contribution is the layer that sits on top: configurable scoring, multi-adapter ranking parity, AI features, and platform glue.
License
MIT
Related Packages
- scolta-core — Rust/WASM scoring, ranking, and AI layer that runs in the browser.
- scolta-drupal — Drupal 10/11 Search API backend with Drush commands, admin settings form, and a search block.
- scolta-laravel — Laravel 11/12/13 package with Artisan commands, a
Searchabletrait for Eloquent models, and a Blade search component. - scolta-wp — WordPress 6.x plugin with WP-CLI commands, Settings API page, and a
[scolta_search]shortcode.