pforret / pf-article-extractor
PhpArticleExtractor. Boilerplate Removal and Fulltext Extraction from HTML pages
Installs: 55
Dependents: 0
Suggesters: 0
Security: 0
Stars: 4
Watchers: 0
Forks: 13
Open Issues: 0
Language:HTML
Requires
- php: ^8.2
- ext-dom: *
- ext-libxml: *
- ext-mbstring: *
- fivefilters/readability.php: ^3.2
Requires (Dev)
- ext-curl: *
- laravel/pint: ^1.16
- phpunit/phpunit: ^11.1
README
Boilerplate Removal and Fulltext Extraction from HTML pages.
Rewrite of dotpack/php-boiler-pipe
for PHP8.2 and up, with tests.
Installation
composer require pforret/pf-article-extractor
Usage
use Pforret\PfArticleExtractor\ArticleExtractor; $articleData = ArticleExtractor::getArticle($html); /* * $articleData = Pforret\PfArticleExtractor\Formats\ArticleContents Object ( [title] => Film Podcast: Wicked Little Letters Named Film of the Month [content] => UK Film Club was back in March with a new episode of their film podcast. (...) [date] => [images] => Array ( [0] => https://static.wixstatic.com/media/.../b19cd0_dde0d59546f84127865267f43994f39b~mv2.jpg ) [links] => Array ( [0] => https://www.chrisolson.co.uk/ (...) ) ) */
Under the hood
- package accepts a full HTML page as input
- it will walk the DOM tree and try to find the main article content
- it will remove boilerplate content (like headers, footers, sidebars, ...)
- it will try to extract the main article content
- it will try to extract the title, date, images and links from the article
Rights now it's tested with example pages for
- Blogger
- Drupal
- Jekyll
- Mkdocs
- Wix
- WordPress
Similar packages
- beautifulsoup4 - Python, MIT
- html-text - Python, MIT
- kohlschutter/boilerpipe - Java, Apache 2.0
- fivefilters/readability.php - PHP, GPL-3.0
- miso-belica/jusText - Python, BSD2
- codelucas/newspaper - Python, Apache