phikhi/url-to-text

Extract texts from an url

v1.0.5 2023-03-02 15:35 UTC

This package is auto-updated.

Last update: 2024-05-30 00:42:30 UTC


README

Extract any texts from a distant HTML page 🚧 WORK IN PROGRESS (do not use) 🚧

Installation

composer require phikhi/url-to-text

Usage

Basic usage

use Phikhi\UrlToText\UrlToText;

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toArray();
/*
[
    'lorem ipsum dolor sit amet',
    'non gloriam sine audentes',
    '...'
];
*/

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toJson();
// ['lorem ipsum dolor sit amet', 'non gloriam sine audentes', '...'];

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->extract()
    ->toText();
/*
lorem ipsum dolor sit amet
non gloriam sine audentes
...
*/

Advanced usage

You can customize the tags you want to parse

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->allow(['div', 'span']) // will add these tags to the existing allowed tags array (H*, p, li, a).
    ->extract()
    ->toArray();

If you want to overwrite the allowed tags array instead of extending it, you can pass a second parameter to the allow() method

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->allow(['div', 'span'], overwrite: true) // will replace the existing allowed tags array with this one.
    ->extract()
    ->toArray();

By default, script and style tags are automatically stripped before extracting the allowed tags from the DOM, to prevent some weird behavior during extraction. But you can still customize them if you need with the deny() method.

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->deny(['svg']) // will add the `svg` tag to the existing denied tags array (script, style).
    ->extract()
    ->toArray();

If you want to overwrite the denied tags array instead of extending it, you can pass a second parameter to the deny() method

$text = (new UrlToText())
    ->from('https://phikhi.com')
    ->deny(['svg'], overwrite: true) // will replace the existing denied tags array with this one.
    ->extract()
    ->toArray();