wp-php-toolkit / encoding
Encoding component for WordPress.
Requires
- php: >=7.2
- dev-trunk
- v0.7.3
- v0.7.2
- v0.7.1
- v0.7.0
- v0.6.2
- v0.6.1
- v0.6.0
- v0.5.1
- v0.5.0
- v0.4.1
- v0.4.0
- v0.3.1
- v0.3.0
- v0.2.0
- v0.1.5
- v0.1.4
- v0.1.3
- v0.1.2
- v0.1.1
- v0.1.0
- 0.0.19
- 0.0.18
- 0.0.17
- 0.0.16
- 0.0.15
- v0.0.15-alpha
- 0.0.14
- 0.0.13
- 0.0.12
- 0.0.11
- v0.0.8-alpha
- 0.0.7
- v0.0.7-alpha
- 0.0.6
- v0.0.6-alpha
- v0.0.5-alpha
- v0.0.4-alpha
- v0.0.3-alpha
- v0.0.2-alpha
- v0.0.1-alpha
This package is auto-updated.
Last update: 2026-05-04 13:40:27 UTC
README
| slug | encoding | |||
|---|---|---|---|---|
| title | Encoding | |||
| install | wp-php-toolkit/encoding | |||
| see_also |
|
UTF-8 validation and scrubbing with a pure-PHP fallback when mbstring is unavailable. Detects malformed bytes and replaces them per the Unicode maximal-subpart algorithm.
Why this exists
Every parser in this toolkit eventually has to decide what to do with text bytes. XML rejects malformed UTF-8. JSON and databases can fail late. CSS, HTML, WXR, and Blueprint validation all need consistent answers about whether a string is well-formed Unicode.
The Encoding component provides the small UTF-8 primitives the rest of the toolkit can share: validate bytes, scrub invalid sequences, scan code points, and detect Unicode noncharacters. When mbstring is available it can delegate to it; when it is not, the component uses its own byte scanner so behavior stays available in restricted PHP environments.
Historically, this became the common foundation for Blueprint validation and CSS/XML processing, replacing ad hoc Unicode helpers with the WordPress core UTF-8 routines used here.
Validating UTF-8 before storing it
wp_is_valid_utf8() rejects overlong sequences, surrogate halves, and stray ISO-8859-1 bytes. Use it as a guard in front of any code path that assumes UTF-8 (database, JSON, XML).
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; use function WordPress\Encoding\wp_is_valid_utf8; $samples = array( 'ASCII' => 'just a test', 'UTF-8 pencil' => "\xE2\x9C\x8F", 'latin-1 byte' => "B\xFCch", 'overlong slash' => "\xC1\xBF", 'surrogate half' => "\xED\xB0\x80", ); foreach ( $samples as $label => $bytes ) { echo sprintf( "%-14s %s\n", $label . ':', wp_is_valid_utf8( $bytes ) ? 'valid' : 'invalid' ); }
ASCII: valid
UTF-8 pencil: valid
latin-1 byte: invalid
overlong slash: invalid
surrogate half: invalid
Scrubbing invalid bytes with U+FFFD
Replace each ill-formed sequence with the Unicode replacement character. Useful right before serializing to XML, JSON, or sending to an LLM that will choke on broken bytes.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; use function WordPress\Encoding\wp_scrub_utf8; $broken = "the byte \xC0 should not be here."; echo wp_scrub_utf8( $broken ) . "\n"; echo wp_scrub_utf8( ".\xE2\x8C\xE2\x8C." ) . "\n";
the byte � should not be here.
.��.
Detecting noncharacters MySQL/utf8mb4 will reject
Code points like U+FFFE, U+FFFF, and the U+FDD0–U+FDEF block are valid Unicode but forbidden in XML and rejected by some databases. Check before inserting user-submitted content into a strict utf8mb4 column.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; use function WordPress\Encoding\wp_has_noncharacters; $samples = array( 'normal text' => 'normal text', 'U+FFFE' => "oops \u{FFFE}", 'U+FDD0' => "hi \u{FDD0} bye", ); foreach ( $samples as $label => $text ) { echo sprintf( "%-12s %s\n", $label . ':', wp_has_noncharacters( $text ) ? 'reject' : 'ok' ); }
normal text: ok
U+FFFE: reject
U+FDD0: reject
Three-way pipeline: validate, scrub, then check noncharacters
Real-world inputs are messy: an old WXR export, a CSV with mixed encodings, a paste from Word. Combination of validate + scrub + noncharacter-check covers the three classes of breakage that bite later.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; use function WordPress\Encoding\wp_is_valid_utf8; use function WordPress\Encoding\wp_scrub_utf8; use function WordPress\Encoding\wp_has_noncharacters; $inputs = array( 'good' => 'Café', 'latin1' => "caf\xE9", 'overlong' => "x\xC1\xBFy", 'noncharac' => "hi \u{FFFE} there", ); foreach ( $inputs as $label => $bytes ) { $valid = wp_is_valid_utf8( $bytes ); $cleaned = wp_scrub_utf8( $bytes ); $weird = wp_has_noncharacters( $cleaned ); echo sprintf( "%-10s valid=%s noncharacter=%s -> %s\n", $label, $valid ? 'Y' : 'N', $weird ? 'Y' : 'N', $cleaned ); }
good valid=Y noncharacter=N -> Café
latin1 valid=N noncharacter=N -> caf�
overlong valid=N noncharacter=N -> x��y
noncharac valid=Y noncharacter=Y -> hi � there
Salvaging a legacy ISO-8859-1 column inside a UTF-8 corpus
Old WordPress databases sometimes mix encodings: most rows are UTF-8 but a few were stored as latin-1. Detect the bad rows with wp_is_valid_utf8() and only re-encode those.
<?php require '/wordpress/wp-content/php-toolkit/vendor/autoload.php'; use function WordPress\Encoding\wp_is_valid_utf8; use function WordPress\Encoding\wp_scrub_utf8; $rows = array( 1 => 'Plain ASCII', 2 => 'Café', 3 => "caf\xE9", 4 => "weird \xC0 byte", ); foreach ( $rows as $id => $value ) { if ( wp_is_valid_utf8( $value ) ) { echo "#$id ok: $value\n"; continue; } $converted = @iconv( 'ISO-8859-1', 'UTF-8', $value ); if ( false !== $converted && wp_is_valid_utf8( $converted ) ) { echo "#$id recovered as latin1: $converted\n"; } else { echo "#$id unrecoverable, scrubbing: " . wp_scrub_utf8( $value ) . "\n"; } }
#1 ok: Plain ASCII
#2 ok: Café
#3 recovered as latin1: café
#4 recovered as latin1: weird À byte