teon/brute-force-sitemap-generator

Generate sitemaps by crawling your website for static pages and using hooks for dynamic content

v0.1.0 2015-11-14 15:38 UTC

README

Generate sitemaps by crawling your website for static pages and using hooks for dynamic content. Intermediate sitemap URI list is stored in relative format and served dynamically by generating final documents where relative URIs are prefixed with configured base URI, resulting in final document that contains full URIs.

Features:

  • crawl your website for static content and generate list of URIs;
  • seed crawler with existing URI list, or add URIs manually and recrawl to avoid missed content in the future;
  • store URIs in relative format;
  • when serving sitemaps, convert relative URIs to absolute form with configured prefix URI.

Target users

Sitemaps are generally best served as accurate as possible, and this means that your application needs to have infrastructure prepared for enumerating all content it serves. Many applications do not support this, or support it only partially.

Users that are stuck using such applications and who have to provide sitemaps are usually left with the option of pre-generating sitemaps using public web crawlers. This results in inaccurate and stale sitemaps.

This is where Brute Force Sitemap Generator (BFSG) steps in.

Modes of operation

Definition of terms:

  • base URI: URI under which sitemap will reside, i.e. https://example.com/ (without trailing "sitemap.*)
  • transData: It stands for "transitional data" and represents sitemap data that do not contain absolute URIs. Absotule URIs are generated at the very last stage, where HTTP request for sitemap triggers generation of final sitemap by prefixing all relative URIs with base URI prefix which is obtained dynamically.

BFSG implements the following operations:

  • create transData by crawling existing website ** crawling may be seeded by base URI only ** may be seeded by existing transData (list of URIs that were previously encountered)
  • augment transData generated by crawler by using callback (for dynamically generated pages)
  • using transData cache to generate and output final sitemap.(xml|txt)(.gz)?

BFSG can be glued to your application in the following ways:

  1. add BFSG to your project as git submodule: ** you need to create sitemap-glue.php file that returns needed configuration details from your project ** sitemap-glue.php must reside on the same path level as main BFSG directory (just outside of BFSG source tree) ** (reasoning for this is that you will want to commit your glue code to your project repository instead to BFSG's git repo)
  2. install BFSG with composer - TODO
  3. Symfony: add BFSG as bundle - TODO

License

BFSG is released under MIT license. See LICENSE file at the root of repository for additional info.

Credits

Brute Force Sitemap Generator was created and is maintained by Bostjan Skufca & Teon d.o.o company.