iamgerwin / php-pdf-to-markdown-parser
A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.
Requires
- php: ^8.3
- smalot/pdfparser: ^2.9
Requires (Dev)
- laravel/pint: ^1.13
- mockery/mockery: ^1.6
- pestphp/pest: ^2.34
- phpstan/phpstan: ^1.10
- symfony/var-dumper: ^7.0
README
A lightweight PHP library to convert PDF documents into clean, structured Markdown. Supports text extraction, headings, lists, tables, diagrams and code blocks for easier content reuse and publishing.
Because sometimes PDFs just need to chill out and become Markdown.
Features
- 📝 Text Extraction with Styling - Preserves headings, bold, italic, and strikethrough formatting
- 📊 Table Parsing - Extracts tables with proper headers and body formatting
- 🎨 Diagram Support - Converts diagrams to Mermaid and dbdiagram.io formats
- Flowcharts
- Sequence diagrams
- Entity Relationship Diagrams (ERD)
- Gantt charts
- Class diagrams
- State diagrams
- Pie charts
- 📋 List Detection - Automatically converts bullet points and numbered lists
- 💻 Code Block Recognition - Identifies and formats code snippets
- 🚀 PHP 8.3 Compatible - Built with modern PHP features
- ✅ PSR-12 Compliant - Follows PHP coding standards
Installation
You can install the package via composer:
composer require iamgerwin/php-pdf-to-markdown-parser
Usage
Basic Usage
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser; $parser = new PdfToMarkdownParser(); // Parse a PDF file $markdown = $parser->parseFile('path/to/document.pdf'); // Parse PDF content $pdfContent = file_get_contents('path/to/document.pdf'); $markdown = $parser->parseContent($pdfContent); // Output the markdown echo $markdown;
Working with Tables
The parser automatically detects and converts tables in your PDF:
| Header 1 | Header 2 | Header 3 | | --- | --- | --- | | Row 1 Col 1 | Row 1 Col 2 | Row 1 Col 3 | | Row 2 Col 1 | Row 2 Col 2 | Row 2 Col 3 |
Diagram Extraction
Diagrams are automatically detected and converted to appropriate formats:
Mermaid Flowcharts:
```mermaid flowchart TD Start --> Process --> End
**ERD (dbdiagram.io format):**
```markdown
```dbdiagram
Table users {
id int
name varchar
email varchar
}
**Sequence Diagrams:**
```markdown
```mermaid
sequenceDiagram
User->>System: Request
System->>Database: Query
Database->>System: Response
System->>User: Result
### Text Styling
The parser preserves text styling from PDFs:
- Headings (H1-H6) based on font size and formatting
- **Bold text**
- *Italic text*
- ~~Strikethrough text~~
- Lists (bulleted and numbered)
- Code blocks
## Advanced Configuration
### Custom Extractors
You can extend the parser with custom extractors:
```php
use Iamgerwin\PdfToMarkdownParser\PdfToMarkdownParser;
use Iamgerwin\PdfToMarkdownParser\Extractors\TextExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\TableExtractor;
use Iamgerwin\PdfToMarkdownParser\Extractors\DiagramExtractor;
$parser = new PdfToMarkdownParser();
// The parser uses these extractors internally:
// - TextExtractor: Handles text and styling
// - TableExtractor: Processes tables
// - DiagramExtractor: Converts diagrams
Testing
Run the test suite:
composer test
Run tests with coverage:
composer test-coverage
Run PHPStan static analysis:
composer analyse
Format code with Laravel Pint:
composer format
Requirements
- PHP 8.3 or higher
- ext-mbstring
How It Works
The parser uses a multi-stage extraction process:
- PDF Parsing - Uses the robust smalot/pdfparser library to extract raw content
- Text Analysis - Identifies text styling, headings, and formatting patterns
- Table Detection - Recognizes table structures (pipe, tab, or space-separated)
- Diagram Recognition - Detects diagram patterns and converts to Mermaid/dbdiagram formats
- Markdown Generation - Combines all elements into properly formatted Markdown
Limitations
- Images: Currently, images are not extracted (coming in future versions)
- Complex Layouts: Multi-column layouts may require manual adjustment
- Font Styling: Basic bold/italic detection is simplified (font metadata parsing is limited)
- Diagrams: Pattern matching may not catch all diagram types
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Security
If you discover any security related issues, please email iamgerwin@live.com instead of using the issue tracker.
Credits
License
The MIT License (MIT). Please see License File for more information.
Acknowledgments
Built with inspiration from the PHP community and the need to make PDF content more accessible and reusable. Special thanks to the maintainers of smalot/pdfparser for their excellent PDF parsing library.