zackslash / php-web-article-extractor
Web Article Extractor
This package's canonical repository appears to be gone and the package has been frozen as a result.
Installs: 2 149
Dependents: 0
Suggesters: 0
Security: 0
Stars: 42
Watchers: 7
Forks: 21
Open Issues: 2
Requires
- php: >=5.5.0
Requires (Dev)
- phpunit/phpunit: 3.7.*
This package is not auto-updated.
Last update: 2021-06-23 08:41:20 UTC
README
Web Article Extractor is a PHP library that detects and extracts the primary 'article' content from a web page, detecting and removing the 'clutter' to give you the clean article. Additionally, it will also filter information from the article that can be used for indexing, such as language and keywords.
Features
- Extracts a clean article and headline from a web page quickly.
- Identifies the language of the extracted article.
- Identifies keywords for the extracted article.
- Designed to easily integrate into pipeline or microservice project architectures.
Usage
There are two ways to use Web Article Extractor, the first way is to use the provided Docker file (See 'Installation'), this will create an instance that you can start using straight away and is ideal for pipeline architectures, the second way is to add the PHP library into your project through Composer.
Installation
Docker
To build with Docker execute the build command inside this project's root directory
$ docker build -t zackslash/web-article-extractor .
You should now be able to run the article extractor script with the following command
$ docker run zackslash/web-article-extractor <URL>
Example:
$ docker run zackslash/web-article-extractor http://uk.ign.com/articles/2015/03/19/gabe-newell-discusses-possibility-of-half-life-3
Composer
The first step to using Web Article Extractor in PHP is to download Composer:
$ curl -s http://getcomposer.org/installer | php
Now add PHP Web Article Extractor to your project with Composer:
$ php composer.phar require zackslash/php-web-article-extractor
And that's it! Composer will automatically handle the rest.
Alternatively, you can manually add the dependency to composer.json
file...
{ "require": { "zackslash/php-web-article-extractor": "*" } }
... and then install our dependencies using:
$ php composer.phar install
PHP simple example:
<?php // This file is generated by Composer require_once 'vendor/autoload.php'; // Extract article directly from a URL $extractionResult = WebArticleExtractor\Extract::extractFromURL('http://uk.ign.com/articles/2015/03/19/gabe-newell-discusses-possibility-of-half-life-3'); // Display the extracted article in JSON form echo json_encode($extractionResult); ?>
Requirements
- PHP >= 5.5.0
Running the Tests (Optional)
To run the unit tests, you'll need to install PHPUnit, once installed, just launch the following command inside this libraries' 'build' directory:
$ phpunit
Acknowledgements
Parts of PHP Web Article Extractor are based on algorithms from the whitepaper 'Boilerplate detection using Shallow Text Features' and 'Boilerpipe' by Christian Kohlschuetter, Peter Fankhauser, Wolfgang Nejdl
PHP Web Article Extractor implements the Rapid Automatic Keyword Extraction (RAKE) algorithm as described in the book 'Text Mining: Theory and Applications' and the implementation was based on aneesha's open source Python version
The Stop Word dictionary used in this project was pulled from Peter Graham's 'stopwords' repository
License
PHP Web Article Extractor is released under the MIT License. See the bundled LICENSE file for details.