contextualcode/content-import

Content import bundle

Installs: 646

Dependents: 0

Suggesters: 0

Security: 0

Stars: 1

Forks: 1

Type:symfony-bundle

v2.9.2 2024-04-25 17:05 UTC

README

This package provides content import functionality from the results of contextualcode/crawler. Despite the fact originally it was designed for eZ Platform v3 it does not contain any specific CMS/CMR/DXP functionality. It provides an abstraction layer, so can be easily implemented in any new CMS/CMR/DXP.

Requirements

The only and the main requirement for any new platform where this package is going to be implemented is the platform content model. We assume all modern CMS/CMR/DXP have similar content model to eZ Platform:

  • Each content item has its content type. For example, an article is an instance of the "Article" content type. And some product is an instance of the "Product" content type.
  • The content is versioned. So each content item has multiple versions. And the version identifier is an integer number.
  • The content item is used only to store the data (content fields).
  • All the content items are structured by using separate "Location" items. The type of structure is not important it might be a tree/catalog/etc.

Basic Concepts

This package introduces a few new content import-related concepts. And it is very important to have an understanding of each of them. More detailed information is available in the reference.

Page

Each website URL is represented by ContextualCode\Crawler\Entity\Page entity. Even binary files and images have corresponding own Page entities.

Those entities are created by contextualcode/crawler and are used by this package. Please check crawlers documentation to learn how to crawl a website and store the results in Page entities.

Content Import Handler

Each Page is transformed into a single content item. And there are special handlers to make that transformation. Each of them is handling only 1 specific content type (CMS scope). For example, the "Article" Content Import Handler will convert all the articles, and the "Blog Post" Content Import Handler will convert all the blog posts. The only and single responsibility for an Import Content Handler is to convert a Page entity into a CMS content item.

More details are available in the reference.

Content Field Transformer

Content Import Handler defines the exact way how the Page is converted into the content item. It includes providing the way how to extract content field values from the page for each content. Content field values are extracted from the page by using Content Field Transformers. They receive the Page entity and some options as the input and return the content field value. This package provides a few Content Field Transformers. A good example of Content Field Transformers usage would be an Article page. To convert it into the content item, following data needs to be extracted:

  • Title, text-line Content Field Transformer will be used to get its content: extract a text line from the Page entity by specified XPath selector
  • Body, html Content Field Transformer which extracts an HTML content from the Page entity using provided XPath selector

More details are available in the reference.

Content Hash Transformer

All the extracted content fields are hashed. It is done to be able to determine if there were any changes in the import sources. Content Hash Transformers work similar to Content Field Transformers, but they receive the field value as input and return its string representation.

More details are available in the reference.

Content Hash

Each content item created/updated by the content import script has its Content Hash. Content Hash contains the composed hash for all the content fields. It is used to define if a content item requires an update on the next content import scripts executions. Also, it is used to track if the content was edited manually since its import. In such cases, the content will be not updated during the next content import script executions, as it has some manual changes.

Location Hash

Location hash is very similar to the Content Hash. But it is not calculated based on the content fields, instead only the URL of the source Page used to calculate it. As it is the only parameter that defines the content item position (location) in the CMS content structure.

Content Operations

The list of the CMS specific operations like creating content, updating the content, adding a location, etc. This package just provides an interface and example dummy implementation for Content Operations. They need to be implemented in CMS/CMR/DXP specific package.

More details are available in the reference.

Installation

  1. Require contextualcode/content-import via composer:

     composer require contextualcode/content-import
    
  2. Run the migrations:

     php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/crawler/src/Resources/config/doctrine_migrations.yaml --no-interaction
     php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/content-import/src/Resources/config/doctrine_migrations.yaml --no-interaction
    

Usage

This package has an example dummy CMS integration. In order to integrate it with any new CMS/CMR/DXP, the following steps need to be followed:

  1. Create CMS/CMR/DXP specific package, which will map the content model:

  2. Define the CMS/CMR/DXP specific content operations handler. It should implement ContentOperationsInterface, example: Service/Integration/ContentOperations.

  3. Implemented content operations handler should be registered as ContextualCode\ContentImport\ContentHandler\ContentOperationsInterface service:

     ContextualCode\ContentImport\ContentHandler\ContentOperationsInterface:
         class: ContextualCode\ContentImport\Service\Integration\ContentOperations