anassrojea / laracrawler
Laravel sitemap generator and crawler package for SEO optimization. Supports multilingual sites, image/video indexing, link validation, priority scoring, and advanced crawling automation.
Installs: 4
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/anassrojea/laracrawler
Requires
- php: ^8.0
- guzzlehttp/guzzle: ^7.0
- illuminate/support: ^10.0|^11.0|^12.0
- symfony/css-selector: ^5.0|^6.0|^7.0
- symfony/dom-crawler: ^5.0|^6.0|^7.0
README
A powerful Laravel sitemap generator with crawling, validation, multilingual support, priority auto-scoring, indexability audit, and more.
Optimized for Google SEO best practices.
✨ Features
- Recursive crawling with depth control
- URL normalization (HTTPS, trailing slashes, lowercase, strip queries/anchors)
- Exclusion rules for URLs and assets (regex, extensions, substrings)
- Multilingual alternates (
hreflang) with validation - Image sitemap enhancements
- Extract
<img>+<picture>sources - Add
<image:title>and<image:caption>fromalt/title
- Extract
- Video sitemap enhancements
- Extract
<video>,<source>, and<iframe>(YouTube, Vimeo) - Add
<video:title>and<video:description>(defaults configurable)
- Extract
- Priority auto-scoring
- Based on crawl depth, internal link popularity, freshness
- Supports per-page
priority_boost
- Flexible lastmod strategies
now→ always current timefile→ file modification timedb→ fetch from database columncallback→ resolve dynamically via Closure/service
- Indexability audit
- Detects
noindexin headers (X-Robots-Tag) or meta tags - Excludes such pages and logs them into
sitemap-errors.xml
- Detects
- Link validation
- Detects broken or soft-404 links
- Excludes them and logs into
sitemap-errors.xml
- Split & index
- Auto-splits large sitemaps (
50k URLsor50MBlimit) - Generates
sitemap-index.xml
- Auto-splits large sitemaps (
- Queue support for async crawling in Laravel jobs
- Auto-ping search engines (Google, Bing, Yandex, Baidu)
- Configurable HTTP client (timeouts, SSL verify, User-Agent)
⚙️ Installation
composer require anassrojea/laracrawler
Publish config:
php artisan vendor:publish --tag=laracrawler-config
🛠️ Usage
Generate sitemap:
php artisan laracrawler:generate
Options:
--summary→ Print summary of exclusions.--debug→ Extra debug output.--validate→ Force validation of links even if disabled in config.
📂 Configuration (config/sitemap.php)
🔗 Base settings
'base_url' => env('APP_URL', 'https://example.com'), 'xdefault' => 'https://example.com', // <xhtml:link hreflang="x-default"> 'validate_links' => false, 'max_errors' => 5000,
🚫 Exclusions
'exclude_urls' => [ '/admin', '#\?page=\d+#', // regex pagination '#/search#', '#\.(css|js)$#', ], 'exclude_assets' => [ '#\.(css|js|json|xml|txt|md)$#', '#\.(zip|rar|tar|gz|7z)$#', ],
🌍 Normalization
'normalize' => [ 'strip_queries' => true, 'strip_anchors' => true, 'strip_trailing_slash'=> true, 'canonicalize' => true, // lowercase 'enforce_https' => true, 'enforce_www' => null, // true = add, false = strip 'force_trailing_slash'=> false, ],
🌐 Multilingual
'default_lang' => 'en', 'lang_mode' => 'path', // "path", "subdomain", or "query" 'alternates' => [ 'en' => 'https://example.com/en', 'ar' => 'https://example.com/ar', 'tr' => 'https://example.com/tr', ],
🖼 Include Rules
'include' => [ 'urls' => true, 'images' => true, 'videos' => true, 'languages' => true, 'rules' => [ '#/blog#' => [ 'images' => true, 'videos' => false, ], ], ],
🖼 Image Settings
'image_whitelist' => [ // '/storage/uploads/services/', ], 'image_defaults' => [ 'title' => 'Image Title', 'description' => 'Image Description', ],
🎥 Video Settings
'video_whitelist' => [ // '/storage/uploads/services/', ], 'video_defaults' => [ 'title' => 'Video Title', 'description' => 'Video Description', ],
📊 Rules (SEO Overrides)
Rules let you override defaults per URL pattern:
'rules' => [ '/$' => [ // homepage 'changefreq' => 'daily', 'priority' => '1.0', 'lastmod' => 'now', ], '/blog' => [ 'changefreq' => 'daily', 'priority' => '0.9', 'priority_boost'=> 0.3, // 🚀 boost blogs slightly 'lastmod' => [ 'strategy' => 'db', 'table' => 'posts', 'lookup' => 'slug', 'column' => 'updated_at', ], ], '#^/(en|ar|tr)?/service#' => [ 'changefreq' => 'weekly', 'priority' => null, // auto-score 'priority_boost'=> 0.3, // 🚀 boost services 'lastmod' => [ 'strategy' => 'db', 'table' => 'services', 'lookup' => 'slug', 'column' => 'updated_at', ], ], ],
priority→ fixed value (0.1–1.0) ornullfor auto-score.priority_boost→ bump score (applied only if auto-score).lastmodstrategies:"now"→ always current timestamp"file"→ filesystemmtime"db"→ fetchupdated_atfrom DB"callback"→ custom closure or service
📈 Priority Scoring
'priority_scoring' => [ 'enabled' => true, 'weights' => [ 'depth' => 0.4, 'links' => 0.4, 'freshness' => 0.2, ], 'min' => 0.1, 'max' => 1.0, ],
📡 Pinging Search Engines
'ping' => true, 'ping_targets' => [ 'Google' => 'http://www.google.com/ping?sitemap=', 'Bing' => 'http://www.bing.com/ping?sitemap=', 'Yandex' => 'https://webmaster.yandex.com/ping?sitemap=', 'Baidu' => 'http://ping.baidu.com/ping?sitemap=', ],
🧵 Queue Support
'queue' => [ 'enabled' => false, 'connection' => 'default', 'batch_size' => 100, ],
🌐 HTTP Client Settings
'http' => [ 'validate_links' => [ 'timeout' => 10, 'connect_timeout' => 5, 'verify' => false, 'http_errors' => false, 'headers' => [ 'User-Agent' => 'LaracrawlerBot/1.0 (https://example.com)', ], ], 'validate_alternates' => [ 'timeout' => 5, 'connect_timeout' => 1, 'verify' => false, 'http_errors' => false, 'headers' => [ 'User-Agent' => 'LaracrawlerBot/1.0 (https://example.com)', ], ], ],
🕵 Indexability Audit
'indexability_audit' => true,
Flags URLs with:
X-Robots-Tag: noindex<meta name="robots" content="noindex">
🛠 Artisan Command
php artisan laracrawler:generate --max-depth=2 --output=public --split --single --no-ping --ping-only --sitemap=sitemap.xml --debug --summary --fresh --queue --validate --audit-indexability
Flags
--max-depth→ set crawl depth--output→ custom output dir--split→ force multiple sitemap files--single→ force one sitemap.xml--no-ping→ skip pinging search engines--ping-only→ only ping, no crawl--sitemap→ custom sitemap name (with ping-only)--debug→ show exclusions in detail--summary→ summary of exclusions--fresh→ clear cache and recrawl--queue→ run crawl in background via jobs--validate→ enable link validation--audit-indexability→ enable noindex audit
📦 Outputs
sitemap.xmlorsitemap-index.xmlsitemap-errors.xml(broken links, invalid alternates, noindex pages)
✅ SEO Benefits
- Clean, canonicalized URLs only
- Correct handling of alternates (
hreflang+x-default) - Image metadata (
title,caption) - Video metadata (
title,description) - Excludes noindex & broken pages automatically
- Auto-prioritization for deep/fresh/popular content
🔧 Best Practices
- Always run with
--validatein production - Configure
ping_targetsso Google/Bing auto-refresh faster - Use
priority_boostin rules for critical pages - Whitelist only important image/video directories to keep sitemap lean
- Enable
indexability_auditto avoid indexing blocked content
📜 License
This package is open-sourced software licensed under the MIT license.
