pforret/pf-article-extractor

PhpArticleExtractor. Boilerplate Removal and Fulltext Extraction from HTML pages

0.3.0 2024-06-03 22:57 UTC

This package is auto-updated.

Last update: 2024-10-13 08:59:41 UTC


README

Tests GitHub Release GitHub Tag GitHub commit activity Packagist Downloads PHP GitHub License

Boilerplate Removal and Fulltext Extraction from HTML pages.

Rewrite of dotpack/php-boiler-pipe for PHP8.2 and up, with tests.

Installation

composer require pforret/pf-article-extractor

Usage

use Pforret\PfArticleExtractor\ArticleExtractor;

$articleData = ArticleExtractor::getArticle($html);
/*
 * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContents Object
(
    [title] => Film Podcast: Wicked Little Letters Named Film of the Month
    [content] => UK Film Club was back in March with a new episode of their film podcast. (...)
    [date] =>
    [images] => Array
        (
            [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg
        )

    [links] => Array
        (
            [0] => https://www.chrisolson.co.uk/
            (...)
        )

)

 */

Under the hood

  • package accepts a full HTML page as input
  • it will walk the DOM tree and try to find the main article content
  • it will remove boilerplate content (like headers, footers, sidebars, ...)
  • it will try to extract the main article content
  • it will try to extract the title, date, images and links from the article

Rights now it's tested with example pages for

  • Blogger
  • Drupal
  • Jekyll
  • Mkdocs
  • Wix
  • WordPress

Similar packages