ahmaadkhader/pdf-to-html

Standalone PHP library for extracting semantic HTML from PDF files. Detects headings, lists, tables, links, and inline styles from PDF content.

Maintainers

Package info

github.com/Ahmaadkhader/pdf-to-html

pkg:composer/ahmaadkhader/pdf-to-html

Statistics

Installs: 13

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

1.0.0 2026-05-11 19:07 UTC

This package is auto-updated.

Last update: 2026-05-12 13:34:28 UTC


README

Standalone PHP library for extracting semantic HTML from PDF files using smalot/pdfparser.

Features

  • Heading detection — identifies heading levels from font size ratios
  • List detection — groups bullet and numbered items into <ul>/<ol> lists
  • Table detection — identifies tabular content from X-coordinate column clustering
  • Link extraction — matches PDF link annotations to text content
  • Inline styles — preserves font size and color differences as CSS

Installation

composer require ahmaadkhader/pdf-to-html

Usage

use Ahmaadkhader\PdfToHtml\PdfToHtml;
use Ahmaadkhader\PdfToHtml\StyleAnalyzer;
use Ahmaadkhader\PdfToHtml\TableDetector;
use Ahmaadkhader\PdfToHtml\LinkExtractor;
use Ahmaadkhader\PdfToHtml\HtmlRenderer;

$styleAnalyzer = new StyleAnalyzer();
$tableDetector = new TableDetector();
$linkExtractor = new LinkExtractor($styleAnalyzer);
$htmlRenderer = new HtmlRenderer($styleAnalyzer);

$converter = new PdfToHtml($styleAnalyzer, $tableDetector, $linkExtractor, $htmlRenderer);

// Extract plain text.
$text = $converter->extractText('/path/to/file.pdf');

// Extract semantic HTML with headings, lists, tables and links.
$html = $converter->extractHtml('/path/to/file.pdf');

// Use native heading tags (h1-h6) instead of class-based.
$html = $converter->extractHtml('/path/to/file.pdf', ['native_headings' => true]);

Architecture

Class Responsibility
PdfToHtml Core orchestration — parses PDF, coordinates sub-components
StyleAnalyzer Font size, color, heading level, and inline style detection
TableDetector Table region detection from DataTm positioning data
LinkExtractor PDF link annotation extraction and text matching
HtmlRenderer Renders classified content lines into semantic HTML

Requirements

  • PHP 8.1+
  • smalot/pdfparser ^2.12

License

GPL-2.0-or-later