markuspoerschke / extractum
Extract information from web pages.
Installs: 1 101
Dependents: 0
Suggesters: 0
Security: 0
Stars: 4
Watchers: 4
Forks: 1
Open Issues: 6
pkg:composer/markuspoerschke/extractum
Requires
- php: ^7.4 || ^8.0
 - ext-dom: *
 - ext-json: *
 - ml/json-ld: ^1.2
 - symfony/css-selector: ^5.1
 - symfony/dom-crawler: ^5.1
 - voku/stop-words: ^2.0
 
Requires (Dev)
- ergebnis/composer-normalize: ^2.11
 - friendsofphp/php-cs-fixer: ^3.0
 - phpmd/phpmd: ^2.9
 - phpunit/phpunit: ^9.4
 - symfony/finder: ^5.1
 - symfony/var-dumper: ^5.2
 - vimeo/psalm: ^4.1
 
- 1.x-dev
 - 1.0.3
 - 1.0.2
 - 1.0.1
 - 1.0.0
 - dev-dependabot/github_actions/actions/cache-4.2.0
 - dev-dependabot/github_actions/reviewdog/action-languagetool-1.20
 - dev-dependabot/github_actions/stefanzweifel/git-auto-commit-action-5.0.1
 - dev-dependabot/composer/friendsofphp/php-cs-fixer-3.45.0
 - dev-dependabot/github_actions/actions/checkout-4
 - dev-dependabot/github_actions/actionsx/prettier-3
 
This package is auto-updated.
Last update: 2025-10-06 05:47:01 UTC
README
Extractum is a PHP library that extracts information from web pages.
Getting Started
Installation
composer require markuspoerschke/extractum
Usage
$uri = 'https://www.example.com/'; $html = file_get_contents($uri); $extractor = new Extractum\Extractor(); $essence = $extractor->extract($html, $uri);
Extracted Information
The extracted information are returned as an object of type Extractum\Essence.
| Property | Description | 
|---|---|
date | 
The date when the web page was published. | 
description | 
Normally the meta description or any other excerpt. | 
image | 
The URL to the preview image. Normally defined as a Open Graph attribute. | 
language | 
The two character language code of the HTML tag. | 
links | 
All links within the main content. | 
parsedDate | 
A DateTimeImmutable object if date | 
text | 
Unformatted text of the main content. All new lines and not needed spaces are removed. | 
title | 
The web pages’s title. This is normally the content of the first h1 tag. | 
License
This package is released under the MIT license.