phptek/staticsiteconnector

An external-content connector that retrieves content by scraping a public website.

1.0-rc1 2014-05-14 09:55 UTC

This package is auto-updated.

Last update: 2020-09-19 03:25:12 UTC


README

Introduction

This module allows you to extract content from another website by crawling and parsing its DOM structure and transforms it directly into native SilverStripe objects, then imports those objects into SilverStripe's database as though they had been created via the CMS.

Although this has the disadvantage of leaving it unable to extract any information or structure that isn't represented in the site's markup, it means no special access or reliance on particular back-end systems is required. This makes the module suited for legacy and experimental site-imports, as well as connections to websites generated by obscure CMS's.

How it works

Importing a site is a 2 or 3 step process (Depending on user-selection).

  1. Crawl
  2. Import
  3. Rewrite Links (Automatic, if selected in step 2.)

A list of URLs are fetched and extracted from the site via PHPCrawl, and cached in a text file under the assets directory.

Each cached URL corresponds to a page or asset (css, image, pdf etc) that the module will attempt to import into native SilverStripe objects e.g. SiteTree and File.

Page content is imported page-by-page using cUrl, and the desired DOM elements extracted via configurable CSS selectors via phpQuery which is leveraged for this purpose.

Migration

See the included migration documentation for detailed instruction on migrating a legacy site into SilverStripe using the module.

Installation

This module requires the PHP Sempahore functions to work. These are installed by default on Debian and some OS/X PHP distributions, but if you're using Macports you'll need to add the +ipc flag when installing php5.

If compiling PHP from source you need to pass three additional flags to PHP's configure script:

./configure <usual flags> '--enable-sysvsem' '--enable-sysvshm' '--enable-sysvmsg'

Once that's done, you can use Composer to add the module to your SilverStripe project:

#> composer require phptek/staticsiteconnector

Please see the included Migration document, that describes exactly how to configure the tool to perform a site-scrape / migration.

There is also an example database-dump (MySQL/MariaDB only) provided which you can import into your DB to get you up and running quickly.

License

This code is available under the BSD license, with the exception of the PHPCrawl library, bundled with this module which is GPL version 2.

Authors