An external-content connector that retrieves content by scraping a public website.
This module allows you to extract content from another website by crawling and parsing its DOM structure and transforms it directly into native SilverStripe objects, then imports those objects into SilverStripe's database as though they had been created via the CMS.
Although this has the disadvantage of leaving it unable to extract any information or structure that isn't represented in the site's markup, it means no special access or reliance on particular back-end systems is required. This makes the module suited for legacy and experimental site-imports, as well as connections to websites generated by obscure CMS's.
Importing a site is a 2 or 3 step process (Depending on user-selection).
- Rewrite Links (Automatic, if selected in step 2.)
A list of URLs are fetched and extracted from the site via PHPCrawl, and cached in a text file under the assets directory.
Each cached URL corresponds to a page or asset (css, image, pdf etc) that the module
will attempt to import into native SilverStripe objects e.g.
Page content is imported page-by-page using cUrl, and the desired DOM elements extracted via configurable CSS selectors via phpQuery which is leveraged for this purpose.
See the included migration documentation for detailed instruction on migrating a legacy site into SilverStripe using the module.
This module requires the PHP Sempahore
functions to work. These are installed by default on Debian and some OS/X PHP
distributions, but if you're using Macports you'll need to add the
If compiling PHP from source you need to pass three additional flags to PHP's configure script:
./configure <usual flags> '--enable-sysvsem' '--enable-sysvshm' '--enable-sysvmsg'
Once that's done, you can use Composer to add the module to your SilverStripe project:
#> composer require phptek/staticsiteconnector
Please see the included Migration document, that describes exactly how to configure the tool to perform a site-scrape / migration.
There is also an example database-dump (MySQL/MariaDB only) provided which you can import into your DB to get you up and running quickly.
This code is available under the BSD license, with the exception of the PHPCrawl library, bundled with this module which is GPL version 2.