README

Boilerplate Removal and Fulltext Extraction from HTML pages.

Rewrite of dotpack/php-boiler-pipe for PHP8.2 and up, with tests.

Installation

composer require pforret/pf-article-extractor

Usage

use Pforret\PfArticleExtractor\ArticleExtractor;

$articleData = ArticleExtractor::getArticle($html);
/*
 * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContents Object
(
    [title] => Film Podcast: Wicked Little Letters Named Film of the Month
    [content] => UK Film Club was back in March with a new episode of their film podcast. (...)
    [date] =>
    [images] => Array
        (
            [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg
        )

    [links] => Array
        (
            [0] => https://www.chrisolson.co.uk/
            (...)
        )

)

 */

Under the hood

package accepts a full HTML page as input
it will walk the DOM tree and try to find the main article content
it will remove boilerplate content (like headers, footers, sidebars, ...)
it will try to extract the main article content
it will try to extract the title, date, images and links from the article

Rights now it's tested with example pages for

Blogger
Drupal
Jekyll
Mkdocs
Wix
WordPress

Similar packages

beautifulsoup4 - Python, MIT
html-text - Python, MIT
kohlschutter/boilerpipe - Java, Apache 2.0
fivefilters/readability.php - PHP, GPL-3.0
miso-belica/jusText - Python, BSD2
codelucas/newspaper - Python, Apache

pforret / pf-article-extractor

Maintainers

Details

README

Installation

Usage

Under the hood

Similar packages