magebitcom / pii-redactor
Framework-agnostic PII detection and redaction for text strings
Requires
- php: ^8.1
- ext-mbstring: *
Requires (Dev)
- phpstan/phpstan: ^1.11
- phpunit/phpunit: ^10.5
This package is auto-updated.
Last update: 2026-06-10 11:48:19 UTC
README
Framework-agnostic PHP library for detecting and redacting PII in text. No runtime dependencies. PHP 8.1+.
Install
composer require magebitcom/pii-redactor
Quick start
use Magebit\PiiRedactor\PiiRedactor; $redactor = new PiiRedactor(); echo $redactor->redact('Mail john@example.com, card 4111 1111 1111 1111')->text(); // Mail [EMAIL], card [CREDIT_CARD]
Detect without redacting
foreach ($redactor->analyze($text)->matches() as $match) { printf("%s [%d, %d) score %.2f via %s\n", $match->entityType, $match->start, $match->end, $match->score, $match->detectorName); }
Built-in detectors
EMAIL, PHONE, CREDIT_CARD (Luhn-validated), IBAN (mod-97-validated), IP_ADDRESS (v4/v6), MAC_ADDRESS, URL, CRYPTO_ADDRESS (BTC/ETH), DATE_OF_BIRTH.
Each detector lives in its own class under Detector\Builtin\ (regexes,
validator and sanitize config), wired together by BuiltinDetectors.
Checksum validation is tri-state: a passing checksum forces confidence to 1.0, a failing one discards the match, otherwise the pattern's base score applies.
Context boosting
Weak patterns (a bare 8-digit phone number, an ambiguous date) carry a low base
score on purpose. They only cross the reporting threshold (default 0.5) when a
context word appears within ~40 characters before the match — +0.35 to the
score, shown in the match's explanation (e.g. context "phone" (+0.35)).
This is the library's primary false-positive control: without it, weak patterns
either flood with noise at a high base score or stay undetectable.
The library ships a single English (en) context-word pack. To recognise terms
in other languages, or words specific to your domain, extend the default pack
with your own words — no fork required. ContextWords is immutable; withWords()
merges onto an entity type's existing list (de-duplicated) and returns a copy:
use Magebit\PiiRedactor\Context\ContextWords; use Magebit\PiiRedactor\Detector\DetectorRegistry; use Magebit\PiiRedactor\EntityType; // English context words only (default) $redactor = new PiiRedactor(); // Extend an existing entity type (e.g. Latvian phone words) and add a custom one $words = ContextWords::default() ->withWords(EntityType::PHONE, ['tālrunis', 'telefons']) // built-in type ->withWords('EMPLOYEE_ID', ['employee', 'staff']); // custom type (string) $redactor = new PiiRedactor(DetectorRegistry::withDefaults($words));
Pass ContextWords::none() for a pack with no context words at all. For full
control over a single detector's words, Builtin\PhoneDetector::create($words)
still accepts a plain string[].
Opting out of boosting. To score matches purely on their patterns, inject the no-op enhancer (or build detectors without context words):
use Magebit\PiiRedactor\Analyzer; use Magebit\PiiRedactor\NullContextEnhancer; use Magebit\PiiRedactor\Detector\DetectorRegistry; $analyzer = new Analyzer(DetectorRegistry::withDefaults(), new NullContextEnhancer());
With boosting disabled, weak detectors (bare phone/date forms) stay below the default threshold and are not reported; strong detectors (email, validated card or IBAN) are unaffected.
Per-entity strategies
use Magebit\PiiRedactor\RedactionConfig; use Magebit\PiiRedactor\Strategy\MaskStrategy; use Magebit\PiiRedactor\Strategy\ReplaceStrategy; $config = new RedactionConfig( ['CREDIT_CARD' => new MaskStrategy('*', 4, true)], // **** -> ************1111 new ReplaceStrategy('<REDACTED:{type}>'), // default for everything else ); $redactor = new PiiRedactor(config: $config);
Options
use Magebit\PiiRedactor\AnalyzerOptions; $options = new AnalyzerOptions( entityTypes: ['EMAIL', 'CREDIT_CARD'], // restrict types minScore: 0.6, // raise threshold allowList: ['support@magebit.com'], // never redact these literals );
Custom detectors
use Magebit\PiiRedactor\Detector\DetectorRegistry; use Magebit\PiiRedactor\Detector\Pattern; use Magebit\PiiRedactor\Detector\PatternDetector; $registry = DetectorRegistry::withDefaults(); $registry->register(new PatternDetector('employee-id', 'EMPLOYEE_ID', [ new Pattern('emp', '/\bEMP-\d{6}\b/u', 0.9, ['EMP-']), ])); $redactor = new PiiRedactor($registry);
The optional fourth Pattern argument is a list of requiredNeedles: cheap
literal substrings the engine checks with str_contains before running the
regex. If none are present the regex is skipped entirely. Every needle must be
guaranteed to appear in any real match (here, EMP- is part of the pattern), or
matches will be lost. The built-in EMAIL/URL/CRYPTO/PHONE detectors use this to
skip their regexes on the (common) lines that contain no @, ://, 0x, etc.
For ML-grade name/location detection, extend Detector\RemoteDetector and
call your NER provider (Google DLP, AWS Comprehend, a Presidio sidecar);
it handles chunking and fail-open/fail-closed behavior.
Performance & logging
Logging is often on a blocking hot path, so per-call latency matters. Construct
PiiRedactor once and reuse it — do not build a new instance per log record.
Construction wires up 9 detectors and their context-word packs, which costs about
as much as a full short-line redaction; reusing the instance roughly halves
per-call latency.
Measured on PHP 8.4 (Xdebug off), reusing a single instance:
| Scenario | µs/op |
|---|---|
| Clean short log line (no PII) | ~5.3 |
| Short line with one email | ~5.9 |
new PiiRedactor() per call + redact |
~11.0 |
So a naive (new PiiRedactor())->redact($line) inside a Monolog processor is
~2x slower than holding one instance on the processor. For a Monolog processor,
build the redactor in the constructor and call redact() in __invoke().
A throwaway benchmark harness lives at tools/benchmark.php
(php -dxdebug.mode=off tools/benchmark.php).
Guarantees & limits
- Byte offsets, UTF-8 safe; all bundled regexes use the
umodifier. - A failing detector never breaks redaction (reported in
failures(), strict mode available). - No regex-based detector can promise 100% recall — treat this as defense in depth, not a compliance guarantee.