kreuzberg-dev / kreuzcrawl
High-performance web crawling engine
Package info
github.com/kreuzberg-dev/kreuzcrawl
Language:Rust
Type:php-ext
Ext name:ext-kreuzcrawl
pkg:composer/kreuzberg-dev/kreuzcrawl
Requires
- php: >=8.2
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.95
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^13.1
This package is auto-updated.
Last update: 2026-06-23 18:00:43 UTC
README
High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.
What and Why?
Kreuzcrawl is the crawling substrate: everything you need to scrape and crawl a site end-to-end from a single Rust core — HTML→Markdown, headless-Chrome fallback, robots/sitemap parsing, per-domain throttling, and an SSRF-safe policy — with identical results across 14 language bindings.
Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling, billing) live in kreuzberg-cloud, the reference operational implementation. Every extension point (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, …) is a trait you inject via CrawlEngineBuilder::with_<trait>(...).
Features
| Feature | Description |
|---|---|
| Structured extraction | Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, response headers |
| Markdown conversion | Clean Markdown with citations, document structure, and fit-content mode |
| Concurrent crawling | Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency |
| 14 language bindings | Rust, Python, Node.js, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly |
| Smart filtering | BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, sitemap discovery |
| Browser rendering | Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass |
| Batch & streaming | Scrape or crawl hundreds of URLs concurrently; real-time crawl events via async streams |
| SSRF-safe by default | Refuses loopback, private, link-local, and cloud-metadata addresses; opt out via env var or CrawlConfig |
| Auth & rate limiting | HTTP Basic, Bearer, and custom-header auth with cookie jars; per-domain request throttling |
| MCP server & REST API | Model Context Protocol integration for AI agents plus an HTTP server with OpenAPI spec |
Supported Platforms
Precompiled binaries for Linux (x86_64/aarch64), macOS (ARM64), and Windows (x64) across every binding. See the platform support reference for the full matrix.
⭐ Star this repo to show your support — it helps others discover Kreuzcrawl.
Quick Start
Language Packages
Java
Available on Maven Central as dev.kreuzberg.kreuzcrawl:kreuzcrawl. See Java README for the dependency snippet and current version.
Elixir
Add {:kreuzcrawl, "~> 0.3"} to your mix.exs dependencies. See Elixir README for full documentation.
Kotlin (Android)
Available on Maven Central as dev.kreuzberg.kreuzcrawl.android:kreuzcrawl-android. See Kotlin README for the dependency snippet and current version.
Swift
Add via Swift Package Manager. See Swift README for full documentation.
Zig
See Zig README for installation and usage.
C/C++ (FFI)
C header + shared library from GitHub Releases. See FFI crate for full documentation.
CLI
cargo install kreuzcrawl-cli
brew install kreuzberg-dev/tap/kreuzcrawl
See CLI README for full documentation.
AI Coding Assistants
Install the Kreuzcrawl plugin from the kreuzberg-dev/plugins marketplace. It ships the Kreuzcrawl agent skills (site crawling, HTML→Markdown scraping, headless-Chrome fallback) plus the kreuzcrawl MCP server, and works with every major coding agent — expand your harness below.
Claude Code
/plugin marketplace add kreuzberg-dev/plugins
/plugin install kreuzcrawl@kreuzberg
Codex CLI
/plugins add https://github.com/kreuzberg-dev/plugins
Then search for kreuzcrawl and select Install Plugin.
Cursor
Settings → Plugins → Add from URL → https://github.com/kreuzberg-dev/plugins, then select kreuzcrawl.
Gemini CLI
gemini extensions install https://github.com/kreuzberg-dev/plugins
Factory Droid
droid plugin marketplace add https://github.com/kreuzberg-dev/plugins
droid plugin install kreuzcrawl@kreuzberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/kreuzberg-dev/plugins
copilot plugin install kreuzcrawl@kreuzberg
opencode
Add the package to opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"plugin": ["@kreuzberg/opencode-kreuzcrawl"]
}
Documentation
Full guides, per-language API references, the substrate/operational model, antibot strategy, and observability live at docs.kreuzcrawl.kreuzberg.dev.
Contributing
Contributions are welcome! See our Contributing Guide.
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
License
Links
- Documentation
- GitHub Repository
- Issue Tracker
- Changelog
- Discord — community, roadmap, announcements.