Deterministic UK postcode analysis and correction using constraint-based grammar rules.

Maintainers

Package info

github.com/jamiethompson/cikmov-php

pkg:composer/jamiethompson/cikmov

Statistics

Installs: 2

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

v0.1.1 2026-02-22 23:45 UTC

This package is auto-updated.

Last update: 2026-03-23 00:04:35 UTC


README

Deterministic UK postcode analysis and correction for format-level validation.

Why This Library Exists

UK postcodes are not free-form text. They are a constrained grammar designed for human use and machine sorting.

The system was designed to:

  • support mechanical and automated sorting
  • reduce transcription errors
  • avoid visually ambiguous character patterns
  • encode geography hierarchically
  • remain human-readable

cikmov models these constraints directly in code and uses deterministic candidate generation plus rule filtering instead of fuzzy matching.

Constraint-Based, Not Fuzzy

This library intentionally does not use probabilistic matching, distance metrics, or fuzzy heuristics.

Why:

  • postcode structure is finite and strongly constrained
  • invalid candidates can be eliminated deterministically
  • behaviour stays explainable, reproducible, and testable
  • correction risk is lower when every decision is rule-backed

Shifted Number-Row Digit Support

cikmov supports deterministic correction when shifted number-row symbols are typed instead of digits.

Supported substitutions:

! -> 1
@ -> 2
" -> 2
# -> 3
£ -> 3
$ -> 4
% -> 5
^ -> 6
& -> 7
* -> 8
( -> 9
) -> 0

Scope rules:

  • mapping is the union of UK + US number-row shifted symbols (including Irish usage of UK layout)
  • no keyboard-layout detection is performed at runtime
  • substitutions are attempted only where grammar requires digits:
    • outward digit positions
    • district digit positions
    • inward first character
  • substitutions are not attempted in letter-only positions
  • when stripping shifted symbols produces an already-valid compact postcode, symbols are treated as noise and not as digit substitutions

Public API

<?php

use Cikmov\Cikmov;

$result = Cikmov::analyse('ec1a ial');

Single public entrypoint:

Cikmov::analyse(string $input, int $minConfidenceToApply = 85): Result;

No configuration object is exposed in v1.

Result Object

Result is a final immutable value object using public readonly properties.

Fields:

  • input: original raw input
  • normalizedInput: display-normalized uppercase form used during analysis (not guaranteed canonical-valid)
  • inputWasValid: whether normalized input was already structurally valid
  • bestCandidate: highest scoring canonical candidate, if any
  • confidence: numeric confidence (0-100)
  • appliedPostcode: applied correction when confidence meets threshold
  • alternatives: other high-ranked canonical candidates for ambiguity reporting

Defensive invariants are enforced in the constructor (confidence bounds, canonical formatting, uniqueness rules, consistency between flags and values).

Grammar Rules Enforced

A postcode is treated as:

[outward] [inward]

Inward unit

Pattern:

digit letter letter

Rules:

  • first inward character must be 0-9
  • last two inward characters must be A-Z
  • last two inward characters must not contain C I K M O V

The CIKMOV exclusion exists because those letters are visually error-prone in the inward unit.

Outward formats

Allowed structural forms:

A9
A9A
A99
AA9
AA9A
AA99

Additional rules:

  • first outward letter cannot be Q V X
  • second outward letter (when present) cannot be I J Z
  • first outward digit is constrained to 1-9 (no leading zero district)
  • area prefix must exist in the embedded official area list

AA9A special restrictions

AA9A is geographically constrained, not globally available.

Rules:

  • fourth outward character must be one of: A B E H M N P R V W X Y
  • allowed area/district combinations:
    • EC with district 1-4
    • SW with district 1
    • WC with district 1-2
    • NW only NW1W
    • SE only SE1P

Why restricted:

  • this pattern reflects specific London district conventions rather than a general pattern.

Special recognised code

  • GIR 0AA is explicitly recognised as valid format.

Non-geographic area prefixes

Included as valid area prefixes:

  • BF
  • BX

Area Prefix Enforcement

Area prefix validation is mandatory in v1. There is no bypass flag.

Why:

  • format validity should reflect real structural postcode grammar
  • optional bypass weakens deterministic correctness and increases false positives

Deterministic Scoring Model

Candidate generation is positional:

  1. normalize input (uppercase, remove non-alphanumeric separators/noise)
  2. generate substitutions only where character class mismatches occur (digit/letter confusion maps)
  3. prune candidates that violate grammar constraints
  4. score surviving candidates numerically (0-100)
  5. select highest score deterministically (score desc, lexical tiebreak)

Scoring policy:

  • edits in outward positions are penalized more than inward positions
  • this reflects higher structural significance of outward geography encoding
  • ambiguity lowers confidence further
  • alternatives are capped at 5 entries for bounded output size
  • shifted number-row symbol penalties:
    • inward digit substitution: -8
    • outward non-area digit substitution: -14
    • outward area digit substitution: -22 (reserved for completeness; current grammar does not place digits in outward area-letter slots)

Ambiguity application policy:

  • ambiguity reduces confidence deterministically
  • correction still applies when reduced confidence remains above the threshold
  • top score ties are not automatically rejected; threshold gating remains the apply gate

Why outward edits are penalized more:

  • outward errors are more likely to alter geographic interpretation
  • inward unit is designed for finer routing granularity and tolerates fewer distinct transformations

Threshold Policy

Correction is applied only when:

confidence >= minConfidenceToApply

Default threshold is 85.

Recommended guidance:

  • 90-95: conservative, lower false positives
  • 85: balanced default
  • 70-80: aggressive correction, more candidate acceptance

Format Validity vs Existence Validation

This library validates and corrects format grammar only.

It does not:

  • verify that a postcode is currently allocated
  • verify that an address is deliverable
  • query any external dataset/API

Why out of scope:

  • keeps behaviour deterministic and offline
  • avoids stale or jurisdiction-specific allocation data dependency
  • maintains cross-language portability of the core algorithm

Examples

1) Valid input

$result = Cikmov::analyse('EC1A 1AL');
// inputWasValid: true
// bestCandidate: "EC1A 1AL"
// confidence: 100
// appliedPostcode: "EC1A 1AL"
// alternatives: []

2) Deterministic correction

$result = Cikmov::analyse('EC1A IAL');
// bestCandidate: "EC1A 1AL"
// confidence: 96
// appliedPostcode: "EC1A 1AL" (default threshold 85)

3) Ambiguous correction

$result = Cikmov::analyse('B01 8TH');
// bestCandidate: e.g. "BD1 8TH"
// alternatives: non-empty
// confidence: reduced because near competing candidates exist
// appliedPostcode may be null if confidence falls below threshold

4) Rejection

$result = Cikmov::analyse('!!!!');
// bestCandidate: null
// confidence: 0
// appliedPostcode: null

5) CIKMOV rejection

$result = Cikmov::analyse('EC1A 1AI');
// invalid due inward forbidden letter I
// no correction is applied

6) Shifted-digit correction

$result = Cikmov::analyse('EC1A !AL');
// bestCandidate: "EC1A 1AL"
// confidence: 92
// appliedPostcode: "EC1A 1AL"

7) Shifted symbol in letter position is rejected

$result = Cikmov::analyse('EC1A 1A!');
// invalid: no shifted-digit substitution in letter-only positions
// bestCandidate: null
// appliedPostcode: null

Embedded Postcode Areas

The full area set is embedded and enforced:

AB AL B BA BB BD BF BH BL BN BR BS BT BX CA CB CF CH CM CO CR CT CV CW DA DD DE DG DH DL DN DT DY E EC EH EN EX FK FY G GL GU GY HA HD HG HP HR HS HU HX IG IM IP IV JE KA KT KW KY L LA LD LE LL LN LS LU M ME MK ML N NE NG NN NP NR NW OL OX PA PE PH PL PO PR RG RH RM S SA SE SG SK SL SM SN SO SP SR SS ST SW SY TA TD TF TN TQ TR TS TW UB W WA WC WD WF WN WR WS WV YO ZE

Testing

The PHPUnit suite covers:

  • grammar and normalization behaviour
  • correction behaviour and scoring outcomes
  • ambiguity and alternatives
  • AA9A positive/negative constraints
  • area enforcement
  • CIKMOV exclusion
  • invalid input rejection
  • GIR 0AA handling
  • Northern Ireland format handling
  • idempotency
  • Result invariants