phikhi / url-to-text
Extract texts from an url
v1.0.5
2023-03-02 15:35 UTC
Requires
- php: ^8.1
Requires (Dev)
- laravel/pint: ^1.6.0
- nunomaduro/collision: ^7.0.5
- pestphp/pest: ^2.0.0
- pestphp/pest-plugin-mock: ^2.0.0
- phpstan/phpstan: ^1.10.3
- rector/rector: ^0.14.8
- symfony/var-dumper: ^6.2.7
README
Extract any texts from a distant HTML page 🚧 WORK IN PROGRESS (do not use) 🚧
Installation
composer require phikhi/url-to-text
Usage
Basic usage
use Phikhi\UrlToText\UrlToText; $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toArray(); /* [ 'lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...' ]; */ $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toJson(); // ['lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...']; $text = (new UrlToText()) ->from('https://phikhi.com') ->extract() ->toText(); /* lorem ipsum dolor sit amet non gloriam sine audentes ... */
Advanced usage
You can customize the tags you want to parse
$text = (new UrlToText()) ->from('https://phikhi.com') ->allow(['div', 'span']) // will add these tags to the existing allowed tags array (H*, p, li, a). ->extract() ->toArray();
If you want to overwrite the allowed tags array instead of extending it, you can pass a second parameter to the allow()
method
$text = (new UrlToText()) ->from('https://phikhi.com') ->allow(['div', 'span'], overwrite: true) // will replace the existing allowed tags array with this one. ->extract() ->toArray();
By default, script
and style
tags are automatically stripped before extracting the allowed tags from the DOM, to prevent some weird behavior during extraction.
But you can still customize them if you need with the deny()
method.
$text = (new UrlToText()) ->from('https://phikhi.com') ->deny(['svg']) // will add the `svg` tag to the existing denied tags array (script, style). ->extract() ->toArray();
If you want to overwrite the denied tags array instead of extending it, you can pass a second parameter to the deny()
method
$text = (new UrlToText()) ->from('https://phikhi.com') ->deny(['svg'], overwrite: true) // will replace the existing denied tags array with this one. ->extract() ->toArray();