stevebauman/hypertext

The best HTML to text transformer

v1.1.0 2024-04-04 20:02 UTC

This package is auto-updated.

Last update: 2024-04-04 20:07:21 UTC


README

A PHP HTML to pure text transformer that beautifully handles various and malformed HTML.

68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f73746576656261756d616e2f6879706572746578742f72756e2d74657374732e796d6c3f6272616e63683d6d6173746572267374796c653d666c61742d737175617265 68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f73746576656261756d616e2f6879706572746578742e7376673f7374796c653d666c61742d737175617265 68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f73746576656261756d616e2f6879706572746578742e7376673f7374796c653d666c61742d737175617265 68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f6c2f73746576656261756d616e2f6879706572746578742e7376673f7374796c653d666c61742d737175617265

Hypertext is excellent at pulling text content out of any HTML based document and automatically:

  • Removes CSS
  • Removes scripts
  • Removes headers
  • Removes non-HTML based content
  • Preserves spacing
  • Preserves links (optional)
  • Preserves new lines (optional)

It is directed at using the output in LLM related tasks, such as prompts and embeddings.

Installation

composer require stevebauman/hypertext

Usage

use Stevebauman\Hypertext\Transformer;

$transformer = new Transformer();

// (Optional) Filter out specific elements by their XPath.
$transformer->filter("//*[@id='some-element']");

// (Optional) Retain new line characters.
$transformer->keepNewLines();

// (Optional) Retain anchor tags and their href attribute.
$transformer->keepLinks();

$text = $transformer->toText($html);

Example

For larger examples, please view the tests/Fixtures directory.

Input:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Blog</title>
</head>
<body>
    <h1>Welcome to My Blog</h1>
    <p>This is a paragraph of text on my webpage.</p>
    <a href="https://blog.com/posts">Click here</a> to view my posts.
</body>
</html>

Output (Pure Text):

echo (new Transformer)->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. Click here to view my posts.

Output (Keep New Lines):

echo (new Transformer)->keepNewLines()->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
Click here to view my posts.

Output (Keep Links):

echo (new Transformer)->keepLinks()->toText($html);
Welcome to My Blog This is a paragraph of text on my webpage. <a href="https://blog.com/posts">Click Here</a> to view my posts.

Output (Keep Both):

echo (new Transformer)
    ->keepLinks()
    ->keepNewLines()
    ->toText($html);
Welcome to My Blog
This is a paragraph of text on my webpage.
<a href="https://blog.com/posts">Click Here</a> to view my posts.