README

This package requires the following system tools:

qpdf: Used to decrypt PDFs and remove security restrictions before OCR processing
- Install on macOS: brew install qpdf
- Install on Ubuntu/Debian: sudo apt-get install qpdf
ocrmypdf: Used for OCR processing of scanned PDFs
- Install on macOS: brew install ocrmypdf
- Install on Ubuntu/Debian: sudo apt-get install ocrmypdf
python3 with pdfplumber: Used for text extraction from OCR'd PDFs
- Install: pip3 install pdfplumber

Usage

The package can be used standalone:

use FnbStatementParser\FnbStatementParser;

$parser = new FnbStatementParser();
$result = $parser->process('/path/to/statement.pdf');

// Access transactions, validation, metadata
$transactions = $result->transactions;
$validation = $result->validation;
$csv = $result->toCsv();

For example, a Laravel app can use this package through a BankStatement model's processPdf() method. All processing logic has been extracted to the standalone package, making it reusable and framework-agnostic.

Testing

Running Tests

./vendor/bin/pest

Testing Methodology

When testing new PDF files or debugging parsing issues, use this methodology:

1. Basic Test Script

Create a test script to analyze parsing results:

<?php
require __DIR__ . '/vendor/autoload.php';
use FnbStatementParser\FnbStatementParser;

$pdfPath = __DIR__ . '/path/to/statement.pdf';
$parser = new FnbStatementParser();
$result = $parser->process($pdfPath);

echo "Total transactions: " . count($result->transactions) . "\n";
echo "Expected credits: " . ($result->validation->expectedCreditCount ?? 'N/A') . "\n";
echo "Actual credits: " . $result->validation->actualCreditCount . "\n";
echo "Expected debits: " . ($result->validation->expectedDebitCount ?? 'N/A') . "\n";
echo "Actual debits: " . $result->validation->actualDebitCount . "\n";

// Show credit transactions
$credits = array_filter($result->transactions, fn($t) => $t->type === 'credit');
foreach ($credits as $credit) {
    echo $credit->date->format('d M Y') . " - " . $credit->description . " - " . number_format($credit->amount, 2) . "\n";
}

2. Analyzing Extracted Text

To debug OCR or parsing issues, save and examine the extracted text:

$result = $parser->process($pdfPath);
file_put_contents('extracted_text.txt', $result->extractedText);

// Then search for specific transactions
$lines = explode("\n", $result->extractedText);
foreach ($lines as $lineNum => $line) {
    if (preg_match('/pattern/i', $line)) {
        echo "Line " . ($lineNum + 1) . ": " . $line . "\n";
    }
}

3. Common Issues to Check

OCR Quality Issues:
- Check for corrupted dates (e.g., "2211 JJuunn" instead of "21 Jun")
- Look for spaces in decimal numbers (e.g., "9,844. 46Cr" instead of "9,844.46Cr")
- Check for spaces before commas (e.g., "48 ,969.15Cr" instead of "48,969.15Cr")
- Verify digit recognition errors (e.g., "929.20" instead of "529.20")
Parsing Issues:
- Verify transactions are on single lines (not split across multiple lines)
- Check if amounts are being extracted correctly
- Verify credit/debit classification is correct
Validation:
- Compare expected vs actual transaction counts
- Verify all credit transactions are identified
- Check for misclassified transactions (credits as debits or vice versa)

4. Testing Specific PDFs

To test a specific PDF file:

php -r "require 'vendor/autoload.php'; \$p = new FnbStatementParser\FnbStatementParser(); \$r = \$p->process('tests/Stubs/filename.pdf'); echo 'Credits: ' . \$r->validation->actualCreditCount . ' (expected: ' . \$r->validation->expectedCreditCount . ')\n'; echo 'Debits: ' . \$r->validation->actualDebitCount . ' (expected: ' . (\$r->validation->expectedDebitCount ?? 'N/A') . ')\n';"

5. PDF Processing Pipeline

The parser follows a multi-step processing pipeline:

PDF Decryption (qpdf): Removes security restrictions from PDFs before OCR processing. Some PDFs aren't password-protected but have restrictions (like "modify anything: not allowed" or "print high resolution: not allowed") that ocrmypdf treats as encryption, causing exit code 8. The qpdf --decrypt command removes these restrictions.
OCR Processing (ocrmypdf): Uses maximum quality OCR settings (see Quality Settings section below).
Text Extraction (pdfplumber): Extracts text from the OCR'd PDF with fallback strategies for optimal results (see Quality Settings section below).

Quality Settings

This package uses maximum quality settings across all processing stages to ensure accurate extraction from scanned PDFs. All quality flags are documented below:

OCR Processing (ocrmypdf)

Located in src/Processor/PdfProcessor.php, the following maximum quality settings are used:

Image DPI: 400 (--image-dpi 400)
- Higher resolution for better character recognition accuracy
Tesseract OCR Engine Mode: 3 (--tesseract-oem 3)
- Uses LSTM OCR Engine (best accuracy mode)
Page Segmentation Mode: 4 (--tesseract-pagesegmode 4)
- Single column mode for consistent formatting
Deskew: Enabled (--deskew)
- Corrects skewed pages for better OCR accuracy
Clean: Enabled (--clean)
- Cleans up artifacts and noise from scanned images
Optimization: 0 (--optimize 0)
- No compression/optimization to preserve maximum quality
PDF/A Image Compression: lossless (--pdfa-image-compression lossless)
- Lossless compression for maximum image quality preservation

Text Extraction (pdfplumber)

Located in src/Processor/PdfProcessor.php, the following extraction settings are used:

Layout Extraction Tolerance (X): 5 (x_tolerance=5)
- Horizontal tolerance for layout-based text extraction (fallback mode)
Layout Extraction Tolerance (Y): 1 (y_tolerance=1)
- Vertical tolerance for layout-based text extraction (fallback mode)

Note: The extraction process first attempts simple extraction (which keeps text on the same line), and only falls back to layout extraction with these tolerance settings if the simple extraction yields poor results (less than 100 characters per page).

Quality Philosophy

All settings prioritize accuracy over performance, ensuring reliable extraction from scanned PDFs. The 400 DPI setting, LSTM OCR engine, lossless compression, and high-quality preprocessing ensure maximum character recognition accuracy for financial documents.

Tesseract OCR Engine Mode 3 — LSTM (best accuracy) Optimize 0 — No compression (preserves quality) PDF/A Image Compression: lossless — Maximum quality Deskew & Clean — Maximum preprocessing enabled Force OCR — Always performs OCR Appropriately set (not higher, but optimal): DPI 400 — High quality for OCR. Higher DPI (e.g., 600) offers diminishing returns and much longer processing. 400 DPI is a good balance for financial documents. Page Segmentation Mode 4 — Single-column mode, appropriate for financial statements. Other modes exist but aren’t better for this use case. pdfplumber tolerances — These are fallback settings. The primary extraction (simple, no layout) is higher quality. Tolerances are only used if simple extraction fails.

eugenefvdm / fnb-pdf-statement-parser

Maintainers

Details