kreuzberg-dev/kreuzcrawl

High-performance web crawling engine

Maintainers

Package info

github.com/kreuzberg-dev/kreuzcrawl

Language:Rust

Type:php-ext

Ext name:ext-kreuzcrawl

pkg:composer/kreuzberg-dev/kreuzcrawl

Statistics

Installs: 37

Dependents: 0

Suggesters: 0

Stars: 115

Open Issues: 2

v0.3.0 2026-06-23 16:07 UTC

README

Bindings Rust Python Node.js WASM Java Go C# PHP Ruby Elixir Dart Kotlin Swift Zig C FFI Docker License Documentation Join Discord

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.

What and Why?

Kreuzcrawl is the crawling substrate: everything you need to scrape and crawl a site end-to-end from a single Rust core — HTML→Markdown, headless-Chrome fallback, robots/sitemap parsing, per-domain throttling, and an SSRF-safe policy — with identical results across 14 language bindings.

Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling, billing) live in kreuzberg-cloud, the reference operational implementation. Every extension point (Frontier, RateLimiter, CrawlStore, EventEmitter, ContentFilter, WafClassifier, …) is a trait you inject via CrawlEngineBuilder::with_<trait>(...).

Features

Feature Description
Structured extraction Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, response headers
Markdown conversion Clean Markdown with citations, document structure, and fit-content mode
Concurrent crawling Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
14 language bindings Rust, Python, Node.js, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly
Smart filtering BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, sitemap discovery
Browser rendering Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
Batch & streaming Scrape or crawl hundreds of URLs concurrently; real-time crawl events via async streams
SSRF-safe by default Refuses loopback, private, link-local, and cloud-metadata addresses; opt out via env var or CrawlConfig
Auth & rate limiting HTTP Basic, Bearer, and custom-header auth with cookie jars; per-domain request throttling
MCP server & REST API Model Context Protocol integration for AI agents plus an HTTP server with OpenAPI spec

Supported Platforms

Precompiled binaries for Linux (x86_64/aarch64), macOS (ARM64), and Windows (x64) across every binding. See the platform support reference for the full matrix.

Star Kreuzcrawl on GitHub

⭐ Star this repo to show your support — it helps others discover Kreuzcrawl.

Quick Start

Language Packages

Python
pip install kreuzcrawl

See Python README for full documentation.

Node.js
npm install @kreuzberg/kreuzcrawl

See Node.js README for full documentation.

Rust
cargo add kreuzcrawl

See Rust README for full documentation.

Go
go get github.com/kreuzberg-dev/kreuzcrawl/packages/go

See Go README for full documentation.

Java

Available on Maven Central as dev.kreuzberg.kreuzcrawl:kreuzcrawl. See Java README for the dependency snippet and current version.

C#
dotnet add package Kreuzcrawl

See C# README for full documentation.

Ruby
gem install kreuzcrawl

See Ruby README for full documentation.

PHP
composer require kreuzberg-dev/kreuzcrawl

See PHP README for full documentation.

Elixir

Add {:kreuzcrawl, "~> 0.3"} to your mix.exs dependencies. See Elixir README for full documentation.

Dart / Flutter
dart pub add kreuzcrawl

See Dart README for full documentation.

Kotlin (Android)

Available on Maven Central as dev.kreuzberg.kreuzcrawl.android:kreuzcrawl-android. See Kotlin README for the dependency snippet and current version.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Zig

See Zig README for installation and usage.

WebAssembly
npm install @kreuzberg/kreuzcrawl-wasm

See WebAssembly README for full documentation.

C/C++ (FFI)

C header + shared library from GitHub Releases. See FFI crate for full documentation.

CLI
cargo install kreuzcrawl-cli
brew install kreuzberg-dev/tap/kreuzcrawl

See CLI README for full documentation.

AI Coding Assistants

Install the Kreuzcrawl plugin from the kreuzberg-dev/plugins marketplace. It ships the Kreuzcrawl agent skills (site crawling, HTML→Markdown scraping, headless-Chrome fallback) plus the kreuzcrawl MCP server, and works with every major coding agent — expand your harness below.

Claude Code
/plugin marketplace add kreuzberg-dev/plugins
/plugin install kreuzcrawl@kreuzberg
Codex CLI
/plugins add https://github.com/kreuzberg-dev/plugins

Then search for kreuzcrawl and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/kreuzberg-dev/plugins, then select kreuzcrawl.

Gemini CLI
gemini extensions install https://github.com/kreuzberg-dev/plugins
Factory Droid
droid plugin marketplace add https://github.com/kreuzberg-dev/plugins
droid plugin install kreuzcrawl@kreuzberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/kreuzberg-dev/plugins
copilot plugin install kreuzcrawl@kreuzberg
opencode

Add the package to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@kreuzberg/opencode-kreuzcrawl"]
}

Documentation

Full guides, per-language API references, the substrate/operational model, antibot strategy, and observability live at docs.kreuzcrawl.kreuzberg.dev.

Contributing

Contributions are welcome! See our Contributing Guide.

Part of Kreuzberg.dev

  • Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
  • Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
  • kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

License

Elastic License 2.0

Links