dmoraschi / sitemap-common
Sitemap generator and crawler library
Requires
- php: >=5.6
- guzzlehttp/guzzle: ~6.0
Requires (Dev)
- mockery/mockery: @stable
- phpunit/phpunit: 4.*@stable
- satooshi/php-coveralls: dev-master
This package is not auto-updated.
Last update: 2024-12-11 20:48:21 UTC
README
This package provides all of the components to crawl a website and build and write sitemaps file.
Example of console application using the library: dmoraschi/sitemap-app
Installation
Run the following command and provide the latest stable version (e.g v1.0.0):
composer require dmoraschi/sitemap-common
or add the following to your composer.json
file :
"dmoraschi/sitemap-common": "1.0.*"
SiteMapGenerator
Basic usage
$generator = new SiteMapGenerator( new FileWriter($outputFileName), new XmlTemplate() );
Add a URL:
$generator->addUrl($url, $frequency, $priority);
Add a single SiteMapUrl
object or array:
$siteMapUrl = new SiteMapUrl( new Url($url), $frequency, $priority ); $generator->addSiteMapUrl($siteMapUrl); $generator->addSiteMapUrls([ $siteMapUrl, $siteMapUrl2 ]);
Set the URLs of the sitemap via SiteMapUrlCollection
:
$siteMapUrl = new SiteMapUrl( new Url($url), $frequency, $priority ); $collection = new SiteMapUrlCollection([ $siteMapUrl, $siteMapUrl2 ]); $generator->setCollection($collection);
Generate the sitemap:
$generator->execute();
Crawler
Basic usage
$crawler = new Crawler( new Url($baseUrl), new RegexBasedLinkParser(), new HttpClient() );
You can tell the Crawler
not to visit certain url's by adding policies. Below the default policies provided by the library:
$crawler->setPolicies([ 'host' => new SameHostPolicy($baseUrl), 'url' => new UniqueUrlPolicy(), 'ext' => new ValidExtensionPolicy(), ]); // or $crawler->setPolicy('host', new SameHostPolicy($baseUrl));
SameHostPolicy
, UniqueUrlPolicy
, ValidExtensionPolicy
are provided with the library, you can define your own policies by implementing the interface Policy
.
Calling the function crawl
the object will start from the base url in the contructor and crawl all the web pages with the specified depth passed as a argument.
The function will return with the array of all unique visited Url
's:
$urls = $crawler->crawl($deep);
You can also instruct the Crawler
to collect custom data while visiting the web pages by adding Collector
's to the main object:
$crawler->setCollectors([ 'images' => new ImageCollector() ]); // or $crawler->setCollector('images', new ImageCollector());
And then retrive the collected data:
$crawler->crawl($deep); $imageCollector = $crawler->getCollector('images'); $data = $imageCollector->getCollectedData();
ImageCollector
is provided by the library, you can define your own collector by implementing the interface Collector
.