PII detection and de-identification SDK for BeNeLux (Belgium, Netherlands, Luxembourg)

Maintainers

Package info

github.com/Weichie-com/blur

Homepage

Issues

pkg:composer/weichie-com/blur

Statistics

Installs: 303

Dependents: 0

Suggesters: 0

Stars: 2

v1.1.2 2026-03-04 11:38 UTC

This package is auto-updated.

Last update: 2026-04-04 11:48:35 UTC


README

PHP Version License Tests Coverage

A data protection and de-identification SDK for BeNeLux (Belgium, Netherlands, Luxembourg) and US identifiers, inspired by Microsoft Presidio.

Features

  • Pattern-based PII Detection: Fast and accurate entity recognition using regex patterns
  • Full Validation: Checksum validation (Luhn, mod-97, 11-proof) for high accuracy
  • BeNeLux-Specific Recognizers:
    • πŸ‡³πŸ‡± Dutch BSN (Burgerservicenummer) with 11-proof validation
    • πŸ‡§πŸ‡ͺ Belgian National Number with mod-97 validation
    • πŸ‡±πŸ‡Ί Luxembourg National ID
    • BeNeLux IBAN codes with mod-97 checksum
    • Phone numbers for BE/NL/LU (using libphonenumber)
  • US-Specific Recognizers:
    • πŸ‡ΊπŸ‡Έ Social Security Number (SSN) with area/group/serial validation
    • πŸ‡ΊπŸ‡Έ Individual Taxpayer ID (ITIN)
    • πŸ‡ΊπŸ‡Έ Passport Number (traditional + next-gen)
    • πŸ‡ΊπŸ‡Έ Driver License (multi-state formats)
    • πŸ‡ΊπŸ‡Έ Bank Account Number
    • πŸ‡ΊπŸ‡Έ ABA Routing Number with checksum validation
  • Generic Recognizers: Email, Credit Card (Luhn), IP Address, URL
  • Multiple Anonymization Strategies:
    • Replace with custom values
    • Redact (remove completely)
    • Mask (partial or full)
    • Hash (SHA-256/SHA-512)
    • Encrypt/Decrypt (AES-256-CBC)
  • Context Enhancement: Boost detection confidence with contextual keywords
  • UTF-8 Support: Full multibyte string handling
  • Type-Safe: Built with PHP 8.1+ strict types

Installation

Install via Composer:

composer require weichie-com/blur

Or add to your composer.json:

{
    "require": {
        "weichie-com/blur": "^1.0"
    }
}

Requirements

  • PHP 8.1+ (for strict types and named parameters)
  • ext-mbstring: Multibyte string support (UTF-8)
  • ext-openssl: AES encryption support
  • giggsey/libphonenumber-for-php: Phone number validation (auto-installed)

Quick Start

<?php

require_once 'vendor/autoload.php';

use Weichie\Blur\Analyzer\AnalyzerEngine;
use Weichie\Blur\Analyzer\RecognizerRegistry;
use Weichie\Blur\Analyzer\Recognizers\BeNeLux\BsnRecognizer;
use Weichie\Blur\Anonymizer\AnonymizerEngine;
use Weichie\Blur\Anonymizer\Models\OperatorConfig;
use Weichie\Blur\Anonymizer\Operators\MaskOperator;

// 1. Setup Analyzer
$registry = new RecognizerRegistry();
$registry->addRecognizer(new BsnRecognizer());

$analyzer = new AnalyzerEngine($registry);

// 2. Analyze text
$text = "Het BSN nummer is 111222333 voor deze klant.";
$results = $analyzer->analyze($text, language: 'nl');

// 3. Setup Anonymizer
$anonymizer = new AnonymizerEngine();
$anonymizer->addOperator(new MaskOperator());

// 4. Anonymize
$operators = [
    'NL_BSN' => OperatorConfig::mask('*', 6)
];

$anonymized = $anonymizer->anonymize($text, $results, $operators);
echo $anonymized->getText();
// Output: "Het BSN nummer is ******333 voor deze klant."

Usage Examples

1. Detecting BeNeLux National IDs

use Weichie\Blur\Analyzer\Recognizers\BeNeLux\BsnRecognizer;
use Weichie\Blur\Analyzer\Recognizers\BeNeLux\BelgianNationalNumberRecognizer;
use Weichie\Blur\Analyzer\Recognizers\BeNeLux\LuxembourgNationalIdRecognizer;

$registry = new RecognizerRegistry();
$registry->addRecognizer(new BsnRecognizer());                      // Dutch BSN
$registry->addRecognizer(new BelgianNationalNumberRecognizer());    // Belgian National Number
$registry->addRecognizer(new LuxembourgNationalIdRecognizer());     // Luxembourg National ID

$analyzer = new AnalyzerEngine($registry);

$text = "BSN: 111222333, BE National: 85.07.30-033.61, LU ID: 1990030112345";
$results = $analyzer->analyze($text, language: 'nl');

foreach ($results as $result) {
    echo "{$result->entityType}: score {$result->score}\n";
}

2. Detecting IBAN Codes

use Weichie\Blur\Analyzer\Recognizers\Generic\IbanRecognizer;

$registry = new RecognizerRegistry();
$registry->addRecognizer(new IbanRecognizer());

$analyzer = new AnalyzerEngine($registry);

$text = "IBAN: NL91ABNA0417164300 (Netherlands), BE68539007547034 (Belgium)";
$results = $analyzer->analyze($text);

3. Multiple Anonymization Strategies

// Strategy 1: Replace with labels
$operators = [
    'NL_BSN' => OperatorConfig::replace('[BSN-REDACTED]'),
    'EMAIL_ADDRESS' => OperatorConfig::replace('[EMAIL]'),
];

// Strategy 2: Partial masking
$operators = [
    'NL_BSN' => OperatorConfig::mask('*', 6, false),          // Mask first 6 chars
    'CREDIT_CARD' => OperatorConfig::mask('*', 12, false),    // Mask first 12 chars
];

// Strategy 3: Complete redaction
$operators = [
    'DEFAULT' => OperatorConfig::redact(),  // Remove all detected entities
];

// Strategy 4: Hashing for consistency
$operators = [
    'NL_BSN' => OperatorConfig::hash('sha256'),
    'IBAN_CODE' => OperatorConfig::hash('sha256'),
];

// Strategy 5: Encryption (reversible)
$key = 'your-secret-key';
$operators = [
    'NL_BSN' => OperatorConfig::encrypt($key),
    'BE_NATIONAL_NUMBER' => OperatorConfig::encrypt($key),
];

4. Context Enhancement

Boost detection confidence when context keywords appear near entities:

$text = "Het BSN nummer is 111222333 voor deze klant.";

$results = $analyzer->analyze(
    text: $text,
    language: 'nl',
    context: ['bsn', 'nummer', 'klant'],  // Boost score when these words are nearby
    scoreThreshold: 0.3
);

// The BSN will have a higher confidence score due to context words

5. Entity Filtering

Detect only specific entity types:

$results = $analyzer->analyze(
    text: $text,
    language: 'nl',
    entities: ['NL_BSN', 'EMAIL_ADDRESS']  // Only detect these types
);

6. Allow List

Whitelist specific values to ignore:

$results = $analyzer->analyze(
    text: $text,
    language: 'nl',
    allowList: ['test@example.com', '111222333']  // Ignore these values
);

Supported Recognizers

BeNeLux-Specific

Entity Type Description Validation Country
NL_BSN Dutch Burgerservicenummer 11-proof checksum πŸ‡³πŸ‡± NL
BE_NATIONAL_NUMBER Belgian National Number mod-97 checksum πŸ‡§πŸ‡ͺ BE
LU_NATIONAL_ID Luxembourg National ID Date validation πŸ‡±πŸ‡Ί LU
IBAN_CODE IBAN (BE/NL/LU) mod-97 checksum πŸ‡§πŸ‡ͺπŸ‡³πŸ‡±πŸ‡±πŸ‡Ί
PHONE_NUMBER Phone numbers libphonenumber πŸ‡§πŸ‡ͺπŸ‡³πŸ‡±πŸ‡±πŸ‡Ί

US-Specific

Entity Type Description Validation Country
US_SSN Social Security Number Area/group/serial rules πŸ‡ΊπŸ‡Έ US
US_ITIN Individual Taxpayer ID Format + digit ranges πŸ‡ΊπŸ‡Έ US
US_PASSPORT Passport Number Pattern (context-boosted) πŸ‡ΊπŸ‡Έ US
US_DRIVER_LICENSE Driver License Multi-state patterns πŸ‡ΊπŸ‡Έ US
US_BANK_NUMBER Bank Account Number Pattern (context-boosted) πŸ‡ΊπŸ‡Έ US
US_ABA_ROUTING ABA Routing Number Weighted checksum (mod 10) πŸ‡ΊπŸ‡Έ US

Generic

Entity Type Description Validation
EMAIL_ADDRESS Email addresses RFC validation
CREDIT_CARD Credit card numbers Luhn checksum
IP_ADDRESS IPv4/IPv6 addresses IP validation
URL URLs URL validation

Supported Operators

Operator Description Parameters
replace Replace with custom value new_value
redact Remove completely None
mask Partial/full masking masking_char, chars_to_mask, from_end
hash SHA-256/SHA-512 hashing algorithm (default: sha256)
encrypt AES-256-CBC encryption key
decrypt AES-256-CBC decryption key

Validation Algorithms

Luhn Checksum (Credit Cards)

Used to validate credit card numbers. Prevents false positives from random digit sequences.

Mod-97 Checksum (IBAN, Belgian National Number)

ISO 7064 mod-97 algorithm for IBAN codes and Belgian National Numbers.

11-Proof Checksum (Dutch BSN)

Dutch "elfproef" (11-check) algorithm for validating BSN numbers.

ABA Routing Checksum (US ABA Routing)

Weighted sum mod-10 algorithm (weights: 3, 7, 1) for validating US ABA routing numbers.

Architecture

Weichie\Blur\
β”œβ”€β”€ Analyzer/
β”‚   β”œβ”€β”€ AnalyzerEngine.php           # Main detection orchestrator
β”‚   β”œβ”€β”€ EntityRecognizer.php         # Base recognizer interface
β”‚   β”œβ”€β”€ PatternRecognizer.php        # Pattern-based recognition
β”‚   β”œβ”€β”€ RecognizerRegistry.php       # Recognizer management
β”‚   β”œβ”€β”€ Recognizers/
β”‚   β”‚   β”œβ”€β”€ Generic/                 # Universal recognizers
β”‚   β”‚   β”‚   β”œβ”€β”€ EmailRecognizer.php
β”‚   β”‚   β”‚   β”œβ”€β”€ CreditCardRecognizer.php
β”‚   β”‚   β”‚   β”œβ”€β”€ IpRecognizer.php
β”‚   β”‚   β”‚   β”œβ”€β”€ UrlRecognizer.php
β”‚   β”‚   β”‚   β”œβ”€β”€ IbanRecognizer.php
β”‚   β”‚   β”‚   └── PhoneRecognizer.php
β”‚   β”‚   β”œβ”€β”€ BeNeLux/                 # BeNeLux-specific
β”‚   β”‚   β”‚   β”œβ”€β”€ BsnRecognizer.php
β”‚   β”‚   β”‚   β”œβ”€β”€ BelgianNationalNumberRecognizer.php
β”‚   β”‚   β”‚   └── LuxembourgNationalIdRecognizer.php
β”‚   β”‚   └── US/                      # US-specific
β”‚   β”‚       β”œβ”€β”€ UsSsnRecognizer.php
β”‚   β”‚       β”œβ”€β”€ UsItinRecognizer.php
β”‚   β”‚       β”œβ”€β”€ UsPassportRecognizer.php
β”‚   β”‚       β”œβ”€β”€ UsDriverLicenseRecognizer.php
β”‚   β”‚       β”œβ”€β”€ UsBankRecognizer.php
β”‚   β”‚       └── AbaRoutingRecognizer.php
β”‚   └── Models/
β”‚       β”œβ”€β”€ RecognizerResult.php
β”‚       └── Pattern.php
└── Anonymizer/
    β”œβ”€β”€ AnonymizerEngine.php         # Main anonymization orchestrator
    β”œβ”€β”€ Operator.php                 # Base operator interface
    β”œβ”€β”€ TextReplaceBuilder.php       # Text manipulation
    β”œβ”€β”€ Operators/
    β”‚   β”œβ”€β”€ ReplaceOperator.php
    β”‚   β”œβ”€β”€ RedactOperator.php
    β”‚   β”œβ”€β”€ MaskOperator.php
    β”‚   β”œβ”€β”€ HashOperator.php
    β”‚   β”œβ”€β”€ EncryptOperator.php
    β”‚   └── DecryptOperator.php
    └── Models/
        β”œβ”€β”€ OperatorConfig.php
        β”œβ”€β”€ OperatorResult.php
        └── EngineResult.php

Design Principles

  1. Simple but Complete: Focus on core functionality without ML/NLP complexity
  2. Pattern-Based: Fast regex matching with validation for accuracy
  3. Type-Safe: PHP 8.1+ with strict types throughout
  4. UTF-8 First: Proper multibyte string handling everywhere
  5. Extensible: Easy to add custom recognizers and operators
  6. Immutable Results: Thread-safe result objects

Performance

  • Fast Pattern Matching: No ML model overhead
  • Efficient Validation: Checksum algorithms run in O(n) time
  • UTF-8 Optimized: Uses mb_* functions for correct character offsets
  • Minimal Dependencies: Only essential libraries (libphonenumber)

Examples

See examples/benelux_example.php for a comprehensive demonstration including:

  • All BeNeLux recognizers in action
  • Multiple anonymization strategies
  • Context enhancement
  • Different operator configurations

Run it:

php examples/benelux_example.php

Contributing

Contributions are welcome! To add support for additional countries:

  1. Create a recognizer in src/Analyzer/Recognizers/CountryName/
  2. Extend PatternRecognizer or implement EntityRecognizer
  3. Add validation logic (checksum, format, etc.)
  4. Include context words in local language(s)
  5. Add tests and examples

License

MIT License - See LICENSE file for details

Credits

This project is inspired by Microsoft Presidio. Special thanks to the Presidio team for their excellent work on PII detection and de-identification.

Roadmap

  • US-specific recognizers (SSN, ITIN, Passport, Driver License, Bank Account, ABA Routing)
  • Additional country-specific recognizers (Germany, France, Spain, etc.)
  • Custom recognizer builder API
  • Batch processing support
  • Performance benchmarks
  • Integration with popular PHP frameworks (Laravel, Symfony)

Support

For issues, questions, or contributions, please visit the GitHub repository.