detain / php-dup-finder
AST-based PHP duplicate-logic detector and refactoring assistant.
Requires
- php: ^8.1
- ext-gmp: *
- ext-hash: *
- nikic/php-parser: ^5.0
- sebastian/diff: ^5.1 || ^6.0
- sugarcraft/candy-core: dev-master
- sugarcraft/candy-hermit: dev-master
- sugarcraft/candy-kit: dev-master
- sugarcraft/candy-log: dev-master
- sugarcraft/candy-palette: dev-master
- sugarcraft/candy-sprinkles: dev-master
- sugarcraft/candy-zone: dev-master
- sugarcraft/sugar-bits: dev-master
- sugarcraft/sugar-charts: dev-master
- sugarcraft/sugar-crumbs: dev-master
- sugarcraft/sugar-prompt: dev-master
- sugarcraft/sugar-spark: dev-master
- sugarcraft/sugar-stickers: dev-master
- sugarcraft/sugar-table: dev-master
- sugarcraft/sugar-toast: dev-master
- sugarcraft/sugar-veil: dev-master
- symfony/console: ^6.4 || ^7.0
Requires (Dev)
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^10.5
- vimeo/psalm: 6.0.0
Suggests
- sugarcraft/candy-shell: spin/choose/filter/confirm CLI helpers for shell-script integration
- sugarcraft/sugar-wishlist: Example SSH endpoint picker app illustrating SugarCraft component usage
- dev-master
- v1.0.0
- dev-perf/ngram-cache
- dev-perf/ngram-index-optimizations
- dev-perf/ngram-int-postings-wired
- dev-perf/ngram-int-postings
- dev-perf/ngram-incremental-cache
- dev-perf/ngram-array-candidates
- dev-perf/ngram-index-cache
- dev-perf/ml-batch-scoring
- dev-perf/ml-client-prewarm
- dev-perf/yield-every-increase
- dev-perf/gc-frequency-tuning
- dev-perf/incremental-block-rangehash
- dev-feat/orm-dedup-option-6-ml-clusterer-wiring
- dev-feat/orm-dedup-option-5-ir-clusterer-wiring
- dev-feat/orm-dedup-option-6-ml-similarity
- dev-feat/orm-dedup-option-5-ir-lift
- dev-feat/orm-dedup-option-4-symbol-registry
- dev-feat/orm-dedup-option-3-behavioural-tags
- dev-feat/orm-dedup-option-2-trinity-collapse
- dev-feat/orm-dedup-option-1-db-canonical
- dev-docs/readme-rewrite
- dev-feat/benchmark-suite
- dev-fix/codacy-error-handling
- dev-feat/phar-slim-build
- dev-fix/codacy-unserialize-allowlist
- dev-fix/codacy-ml-ssrf
- dev-fix/codacy-server-security
- dev-docs/video-tapes
- dev-feat/html-hole-tweaker
- dev-docs/tutorials-adrs-algorithms
- dev-feat/graph-ml-scaffolding
- dev-feat/rest-api-server
- dev-feat/type4-fuzz-mutation
- dev-feat/memory-helpers
- dev-perf/token-cache
- dev-perf/bulk-ngram-hashing
- dev-feat/semantic-type-providers
- dev-feat/json-schema-spec
- dev-feat/comparative-bench
- dev-feat/architectural-analyzers
- dev-feat/refactor-patches-and-tests
- dev-feat/ted-weighted-edits
- dev-feat/minhash-lsh
- dev-feat/bloom-candidate-index
- dev-feat/cluster-coherence
- dev-feat/normalization-plugins
- dev-feat/signint-soft-cancel
- dev-feat/pattern-domain-tags
- dev-feat/framework-pattern-tags
- dev-feat/multi-seed-anti-unify
- dev-feat/php8-syntax-normalization
- dev-feat/persistent-cluster-cache
- dev-feat/pager-and-cluster-cache
- dev-feat/sdk-facade
- dev-feat/project-profiles
- dev-feat/graph-reporters
- dev-feat/union-type-inference
- dev-feat/safety-scoring
- dev-perf/two-tier-ted
- dev-feat/ci-templates
- dev-feat/golden-tests
- dev-feat/cli-verbosity
- dev-feat/per-directory-overrides
- dev-feat/auto-tune
- dev-feat/csv-prom-timeseries-reporters
- dev-feat/refactor-worker
This package is auto-updated.
Last update: 2026-05-11 05:51:47 UTC
README
A semantic clone detector and refactoring assistant for PHP codebases. Behaves more like an "extract function" advisor than a copy/paste finder.
phpdup parses every file in a PHP codebase into an Abstract Syntax
Tree, normalizes those ASTs into a canonical form, and finds clusters
of parameterizable duplication — places where the shape of the
code repeats and only literals, identifiers, method names, table
names, or whole optional segments of code vary.
For each cluster it doesn't just point at the duplicates, it tells you what the abstraction would look like — its parameter list, types, and a suggested function name — ready to drop into a refactor.
A run on tests/Fixtures returns its top 2 clusters by impact. Each
cluster's "Suggested abstraction" box is the function signature
phpdup is recommending you extract; the "Holes" table lists every
parameter with its inferred type and the values observed across cluster
members. Compare with classic copy/paste detectors that only highlight
the duplication; phpdup tells you the threshold and the role string
are the parameters of the abstraction with their inferred types and
observed values, ready to apply.
Table of contents
- Features
- Installation
- Self-update
- Quick start
- How it works
- Type-3 / optional-segment detection
- Type-4 / behavioural similarity
- ORM- / DB-aware semantic deduplication
- TUI mode
- Watch mode
- SIGINT soft-cancel
phpdup serveREST API- Output formats
- Configuration
- CLI reference
- Programmatic use
- Examples
- Static analysis & config validation
- Benchmarks
- Architecture
- Testing
- Performance
- Roadmap
- FAQ
- Contributing
- License
Features
- Semantic, not textual. Compares AST structure, not source text — so whitespace, comments, and identifier renames don't fool it.
- All four clone types. Type-1 (exact), type-2 (renamed variables
/ literals), type-3 (statements present in some members but
absent from others — see
Type-3 / optional-segment detection),
and an experimental type-4 behavioural-similarity scorer that
catches I/O-equivalent code with structurally different
implementations (e.g.
foreach-accumulator vsarray_reduce,switchvsmatch). See Type-4 / behavioural similarity. - Parameter discovery. For every cluster, identifies the literals, identifiers, method names, class names, and entire optional code segments that vary, and proposes them as parameters of a suggested abstraction with inferred types and named placeholders.
- Three normalization modes. From
strict(variable rename tolerant only) toaggressive(also collapses literal values, method names, property names, and class names) — pick the precision/recall trade-off you want. - Three-phase clustering.
- Hash buckets for exact canonical matches — O(N).
- N-gram inverted index + Jaccard + APTED tree-edit-distance for near-duplicates — never quadratic in practice.
- Containment fallback for type-3 clones whose Jaccard would fail because one block is a near-subset of another.
- Anti-unification with statement-array LCS. For every cluster phpdup computes the most-specific generalization of its members, using LCS over per-statement structural hashes when stmt arrays differ in length. Disagreements become typed parameter holes; per-statement gaps become defaulted boolean parameters.
- Pattern recognition (22 tags). Tags clusters that match
well-known refactor archetypes — structural, domain, framework:
- Structural:
sql-builder·crud-handler·validation-chain·strategy·config-driven·state-machine·optional-segments. - Domain (I.A.2-7):
loop-map(foreach + accumulator append) ·loop-filter(foreach + leading guard-continue) ·sql-query(literal SQL string in body) ·http-call(Guzzle / cURL /wp_remote_*shapes) ·error-handler(try-catch with logger or rethrow) ·builder-chain(≥3 method-call chain) ·container-registration(DI binders) ·db-op(DB operation shape —__DB_READ__/__DB_WRITE__/__DB_UPSERT__tokens produced by--db-awarecanonicalisation). - Framework (IX.A):
controller-action(Laravel/Symfony controllers) ·migration(Laravel/Doctrine migrations) ·eloquent-model(App\\Models\\…) ·repository-method(*Repository::find/get/save…) ·event-listener·service-provider(Laravel SP / Symfony Bundle) ·query-builder-chain(Doctrine/Eloquent QB entry points).
- Structural:
- Architectural analyzers (IX.B). Beyond the pattern tags,
three analyzers post-process every cluster and emit
architectural_findings[]with severity + remedial suggestion:- SOLID — flags SRP-mixed bodies (persistence + side-effect calls in one block) and DIP-violating concrete-class-string holes.
- Design pattern — recognises Strategy / Factory / Builder shapes from existing tags + hole types.
- Anti-pattern — Long Parameter List (>5 holes), Primitive Obsession (all holes scalar primitives).
- Cluster coherence (VI.A.3). Per-cluster outlier detection
via mean pairwise n-gram Jaccard; flags members that don't
belong with the rest. Surfaces as
outlier_members[]in JSON and a ⚠ marker in CLI. - Refactor-safety scoring (VI.A.1). Combines hole type-safety,
cross-namespace span, member count, and pattern-tag deltas into
a
[0,1]score.--min-safetyfilters clusters that look risky to mechanically extract. - Impact-ranked output. Clusters sorted by how many lines disappear if the abstraction is applied, with a separate confidence score that flags risky refactors (subtree-level holes, cross-namespace spans).
- Twelve output formats.
- SugarCraft-styled colorized CLI (with a
--plainswitch and--summary-only/--clustersverbosity modes). - Structured JSON (machine-readable, full cluster + hole
metadata including
present_in_members[]for type-3 holes,outlier_members[],architectural_findings[],safetyscore; carries aschema_versionfield —docs/JETBRAINS_PLUGIN.mdis the stable contract for IDE plugins). - Interactive HTML site (sortable/filterable index, mini-map
of cluster impact, copy-signature buttons, syntax-highlighted
code, optional-segment rows tinted amber, per-hole tweaker
UI with
localStoragepersistence + JSON export). - SARIF 2.1.0 (GitHub Code Scanning / GitLab Code Quality,
with grouping fingerprints +
optionalSegmentCount). - GitLab SAST v15.x (MR security widget, severity-bucketed by impact).
- Unified diffs per cluster + cumulative
--patchfile. - Checkstyle XML (Jenkins / Sonar / Bitbucket consumers).
- CSV (
--csv=FILE) — one row per cluster member with similarity / impact / safety / signature for spreadsheets and BI ingest. - Prometheus text-format (
--prometheus=FILE) —phpdup_clusters_total,phpdup_total_impact, per-tag counters; ready for pushgateway scraping. - Time-series JSONL (
--timeseries=FILE) — one append-only line per run, commit-tagged viaGIT_COMMIT/GITHUB_SHA/CI_COMMIT_SHA; lets you track duplicate debt over time in BigQuery / ClickHouse / Elastic. - Graphviz DOT (
--graphviz=FILE) — file→cluster bipartite graph; render withdot -Tpng …. - PlantUML (
--plantuml=FILE) — class diagram with cluster packages and pattern-tag stereotypes. - Refactor patches (
--refactor-patch=DIR) — heuristic, manual-review-required.patchper cluster: adds aRefactored/<id>.phpskeleton plus per-member edit hints. Bails to a manual-review header when$this/self::/yield/ closure capture would make mechanical replacement unsafe. - PHPUnit test skeletons (
--refactor-tests=DIR) — onemarkTestIncomplete()test class per cluster with a data provider populated from observed hole values.
- SugarCraft-styled colorized CLI (with a
- Optional SugarCraft TUI (
--tui). Four-pane FlexBox dashboard driven by the cooperative pipeline — counts, sparkline, OSC 9;4 taskbar progress all build up live frame-by-frame as work progresses. Six themes, full keyboard, ←/→ to cycle clusters in detail view. - Watch mode (
--watch). Re-runs analysis on file changes via a poll-basedReact\EventLooptimer;Ctrl+Cexits cleanly. Combines with--tuifor a live dashboard that resets and rebuilds on every change. - SIGINT soft-cancel. First
Ctrl+CflipsPipelineState::$cancelled; cooperative stages check it between yields, short-circuit to the Reporting stage, and produce a partial report. SecondCtrl+Cfalls back to the default kill. Exit code is the canonical 130. - REST API server (
phpdup serve). Minimal HTTP service for in-house dashboards, CI integrations, and the playground front-end. Routes:GET /healthz,POST /analyze(sync),POST /jobs+GET /jobs/{id}(async-shaped). Hand-rolled HTTP/1.1 parser — no ReactPHP dep. SSRF guard, Content-Length cap, JSON_THROW_ON_ERROR, default bind to127.0.0.1. Seedocs/SERVER.md. - PHAR distribution.
phpdup.pharis published with every release plus a sha256 sidecar; download once, run anywhere with PHP 8.1+ andext-phar.phpdup self-update(aliasesupdate/upgrade) replaces the running binary in place after hash verification. Build locally withphp -d phar.readonly=0 build-phar.php. Seedocs/PHAR.md. - Block-kind filter (
--kinds=method,closure). Drops non-matching block kinds at extraction time so clustering only sees what you asked for. - PHP 8.x syntax-surface canonicalisation.
matcharms collapsed to switch-equivalent token shape, named arguments reordered to lexicographic order, attributes stripped inaggressivemode — so syntactic-only differences across PHP 8.0 → 8.4 don't fool clustering. - Auto-tune (
--auto-tune). Probes the corpus before analysis and picks size-appropriate defaults: tiny (<200 files) gets relaxed thresholds, medium (<20k) tightensmax-df, large (≥20k) forces--exact-onlyto keep memory in check. Explicit CLI flags always override the picks. - Project profiles.
profiles/{laravel,symfony,drupal, wordpress,myadmin,generic}.jsonship preset config.--profile=NAMEselects one explicitly; auto-detect kicks in via project-marker files (artisan,bin/console,wp-config.php,include/Orm, etc.) when no explicit config is given. - Per-directory config overrides.
.phpdup.jsonfiles discovered at any depth in the scan tree, validated against the schema, layered ancestor-first so deeper directories override parents. Symbols (db_symbols.methodsanddb_symbols.functions) are additive — per-directory symbols append to root-config symbols rather than replacing them, allowing incremental extension without full duplication. - User-defined normalization plugins. Implement
Phpdup\Normalization\NormalizationPlugin, register the FQCN inphpdup.json -> normalization.plugins[], and your plugin runs after the built-in passes for project-specific canonicalisation (e.g. unify SDK alias methods, project-specific identifier normalisation). - Schema-validated config.
phpdup.jsonis checked againstdocs/config-schema.jsonat load time;--validate-configexits with the field path on the first violation before any analysis runs. - Shell completion for bash, fish, and zsh via
phpdup completion <shell>. Output is the standard Symfony Console completion script with commented-out installation instructions prepended, so you can paste-and-follow inline. - Composable pipeline with cooperative iteration. Five stages
(
Scanning,Preprocessing,Clustering,Refactoring,Reporting) all implement a tinyStageInterfaceand share aPipelineState; cooperative stages additionally yield mid-execution so the TUI can repaint while parallel work is in flight. AProgressListenerinterface lets observers (the TUI, watchers) hook in without touching stages. - Streaming worker pool.
WorkerPool::runStreaming()returns a\Generatorthat yields each child-process result as soon as it arrives — multiplexing children's per-process socketpairs viastream_select— soPreprocessStagecan drive the dashboard live instead of blocking on the slowest worker. The classic collect-and-returnrun()is now a thin synchronous drain over the same code path. - Parallelized preprocessing, pair scoring, and refactor.
pcntl_forkworker pool batches files for parse + extract + normalize + fingerprint (PreprocessWorker); candidate pairs for Jaccard + tree-edit scoring (PairScoreWorker); and per-cluster anti-unification + tagging (RefactorWorker). Auto CPU detection, serial fallback when pcntl is unavailable. - APTED-style tree edit distance with weighted edits.
Zhang-Shasha forest-distance DP with heavy-path child ordering
and bounded early termination, plus a
--ted-weights={default,semantic}cost model that gives method calls cost 2.0, control flow 1.5, literals 0.5 — better proxy for behavioural similarity than unit costs. - Two-tier TED pre-filter. Before the O(n²) DP runs, a size-delta check + 64-bit (node-type, depth) shapelet sketch reject obviously-different pairs in ~10 ALU ops. See Two-tier TED pre-filter.
- Three candidate-pair indexes.
NgramInvertedIndex(default, small / medium corpora),BloomCandidateIndex(drop-in replacement that holds a fixed-size 2 KiB filter per block — trades posting-list memory for O(n²) bit-overlap scoring on huge corpora), andLshIndex(MinHash signatures + 32-band / 4-row LSH — near-constant lookup time per block). - External-sort streaming-clustering primitive. Disk-backed K-way merge sort over (key, payload) tuples — foundation for clustering corpora that don't fit in RAM.
- Incremental indexing. Per-file block snapshots keyed by content hash + parser version + config key. Editing one file leaves the other 999 snapshots intact.
- Persistent cluster cache (II.B.4). Snapshots the final
cluster list to
<cacheDir>/clusters.idxafter every successful run. Re-running on an unchanged corpus skips Cluster + Refactor stages entirely. Wholesale invalidation on any block change. - Lazy AST loading. Original ASTs are dropped after fingerprinting and reloaded on demand only for blocks that end up in clusters. RSS scales sub-linearly with corpus size.
- AST + token caches. SHA-1 keyed disk caches (
AstCachefor full Stmt[] andTokenCachefor raw token streams), both versioned to the parser release so warm-cache runs skip parsing entirely. - Comparative benchmark suite.
bench/run-all.shruns phpdup alongside phpcpd / pmd-cpd / jscpd / simian on a curated mix of real OSS corpora (Symfony Console, Laravel HTTP, PHPUnit, WordPress core) plus a synthetic-fuzz corpus with known ground truth — emits wall-time / RSS / cluster-count comparison plus precision/recall/F1 on the synthetic set. See Benchmarks andbench/feature-matrix.md. - Memory ceiling.
--max-memory=MBwarns and suggests--exact-onlyif peak RSS exceeds the threshold mid-pipeline. --stagehalt point.--stage=clusteringruns the pipeline only up to (and including) clustering and stops — useful for debugging incremental cache hits or profiling individual stages.- Production-ready PHP. Strict types throughout, PSR-4 autoloaded, PHPStan level 6 clean, Psalm errorLevel 6 clean (no baseline), 396+ PHPUnit tests across four suites (Unit, Integration, Golden snapshots, Fuzz detection-rate), requires PHP 8.1+.
Installation
PHAR (recommended)
The fastest way to try phpdup. Download the self-contained phpdup.phar
from the latest GitHub release, verify the SHA-256, and drop it on
your PATH:
curl -sSLO https://github.com/detain/php-dup-finder/releases/latest/download/phpdup.phar curl -sSLO https://github.com/detain/php-dup-finder/releases/latest/download/phpdup.phar.sha256 sha256sum --check phpdup.phar.sha256 chmod +x phpdup.phar sudo mv phpdup.phar /usr/local/bin/phpdup
Requirements: PHP 8.1+ with ext-phar. The phar is ~7 MB.
Once installed, keep it current with the built-in self-updater:
phpdup self-update # download & swap the binary in place phpdup self-update --dry-run # check what's available, change nothing
update and upgrade are aliases. See Self-update
for the full flow + offline notes.
Build the phar yourself (e.g. for a private fork):
composer install --no-dev --optimize-autoloader php -d phar.readonly=0 build-phar.php
See docs/PHAR.md for the full distribution flow,
release process, and troubleshooting (phar.readonly, ext-phar on
shared hosting, etc.).
Via Composer
composer require --dev detain/php-dup-finder vendor/bin/phpdup analyze src
The package isn't on Packagist yet — declare the GitHub repo manually:
{
"require-dev": {
"detain/php-dup-finder": "dev-master"
},
"repositories": [
{ "type": "vcs", "url": "https://github.com/detain/php-dup-finder" }
],
"minimum-stability": "dev",
"prefer-stable": true
}
From source
git clone https://github.com/detain/php-dup-finder.git
cd php-dup-finder
composer install
bin/phpdup analyze /path/to/your/code
Requirements:
- PHP 8.1 or newer
- ext-hash (for
xxh128) - ext-pcntl + ext-posix (optional — without them phpdup runs serially with no other change)
- ext-phar (for the phar distribution; not needed for source/composer)
- ext-curl (optional; preferred over streams for
MlClientand the self-updater's HTTP fetches) - Composer (for source / composer install only)
Self-update
When phpdup is installed as a phar, it can replace itself in place with the latest GitHub release:
phpdup self-update # download, verify sha256, swap phpdup self-update --dry-run # report the latest tag, change nothing phpdup update # alias phpdup upgrade # alias
The flow:
- Resolves
https://api.github.com/repos/detain/php-dup-finder/releases/latest. - Downloads
phpdup.pharandphpdup.phar.sha256from the release assets to a temp dir. - Verifies the SHA-256 — refuses to swap on mismatch.
- Renames the running binary to
phpdup.phar.old(rollback safety), moves the new phar into place, and marks it executable. - Prints the new version.
Requires write access to wherever the phar lives (i.e. don't put it
in /usr/local/bin if you wouldn't sudo to update; install per-
user under ~/.local/bin instead). Falls back gracefully when
GitHub returns "no releases yet" ({"message":"Not Found"}) so a
fresh fork can still call --dry-run without erroring out.
Composer / source installs don't use self-update — pull and
re-install the usual way:
composer update detain/php-dup-finder # or, for a source clone: git pull && composer install
Quick start
Scan a directory and print the top duplicates (auto-parallelized):
bin/phpdup analyze src
Multiple directories with both reports:
bin/phpdup analyze src lib \
--json duplicates.json \
--html duplicates-report \
--min-impact 30
Use a config file for repeatable runs:
bin/phpdup analyze --config phpdup.json
Quick exact-clones-only pass for CI (very fast, ~6 s on a 3,300-block corpus):
bin/phpdup analyze src --exact-only --min-impact 50
Emit every CI-relevant format in one shot:
bin/phpdup analyze src \
--sarif phpdup.sarif \
--gitlab-sast phpdup.gitlab.json \
--diff ./phpdup-diffs \
--checkstyle phpdup.xml \
--json phpdup.json \
--html phpdup-report
Filter to one block kind and gate on impact:
bin/phpdup analyze src --kinds=method --min-impact=50 --exact-only
Live-reload while you refactor:
bin/phpdup analyze src --watch
Show the interactive dashboard while analysis runs:
bin/phpdup analyze src --tui --theme=dracula
How it works
Pipeline
flowchart TB
subgraph Scan["Scanning"]
A[Scanner walks paths] -->|absolute file paths| B[FileScanner]
end
subgraph Preprocess["Preprocessing — parallel via WorkerPool::runStreaming"]
C[AstParser + AstCache] --> D[BlockExtractor with --kinds filter]
D --> E[Normalizer<br/>strict / default / aggressive]
E --> F[SubtreeHasher + NgramFingerprint]
F --> G[(IndexStore<br/>per-file snapshot)]
end
subgraph Cluster["Clustering"]
H[BlockIndex] --> I[NgramInvertedIndex]
I -->|candidate pairs| J{Pair scoring}
J -->|hash-bucket: exact match| K[Edge weight = 1.0]
J -->|Jaccard >= threshold| L[APTED tree-edit-distance]
J -->|Jaccard < threshold<br/>+ optional_blocks_enabled| M[ContainmentSimilarity<br/>type-3 fallback]
L --> N[Edge weight = min<br/>jaccard, ted]
M --> O[Edge weight = containment]
K --> P[Union-find]
N --> P
O --> P
P --> Q[Clusters]
end
subgraph Refactor["Refactoring"]
R[AntiUnifier seed = max-size member] --> S{stmt arrays<br/>differ in length?}
S -->|yes + type-3 enabled| T[LCS on stmt hashes<br/>→ optional_block holes]
S -->|no| U[Recurse normally<br/>→ literal/identifier/name holes]
T --> V[ParameterSynthesizer]
U --> V
V --> W[bool $includeFooBar = false<br/>or typed required param]
W --> X[SignatureBuilder + PatternRecognizer]
end
subgraph Report["Reporting"]
Y[Ranker by impact] --> Z[CLI / JSON / HTML / SARIF<br/>GitLab SAST / Diff / Checkstyle]
end
B --> C
G --> H
Q --> R
X --> Y
%% Observer overlay
PL[ProgressListener<br/>e.g. PhpdupModel TUI] -.observes.-> Scan
PL -.observes.-> Preprocess
PL -.observes.-> Cluster
PL -.observes.-> Refactor
Loading
The whole pipeline is also driven cooperatively as a \Generator
(Pipeline::iter()) — each yield point is a chance for the TUI runtime
to repaint or for a watcher to inject a RestartPipelineMsg.
| Stage | Output |
|---|---|
| Scanning | absolute file paths (glob include/exclude) |
| Preprocessing | annotated blocks (canonical AST + n-gram bag + structural hash) per file |
| Clustering | clusters with similarity scores + edge weights |
| Refactoring | generalized AST, holes, signature, pattern tags |
| Reporting | CLI / JSON / HTML / SARIF / GitLab SAST / diff / Checkstyle output |
Normalization modes
| Mode | Variable rename | Literal collapse | Name collapse |
|---|---|---|---|
strict |
yes | no | no |
default |
yes | yes | no |
aggressive |
yes | yes | yes |
In aggressive mode (default), two functions with different table
names, different method names, and different literal values can still
cluster together.
Clustering
Three phases:
- Exact canonical clones. All blocks sharing the same Merkle hash over the canonical AST land in the same bucket. O(N) work.
- Near-duplicates. For each block, candidates are pulled from a
rare-n-gram inverted index (ignoring n-grams that occur in more than
max_df× N blocks). Each candidate is scored by Jaccard similarity on the canonical n-gram multiset; survivors are refined with APTED-style bounded tree-edit-distance. - Type-3 fallback. When Jaccard fails but
ContainmentSimilarityshows the smaller block is mostly contained in the larger (containment ≥ 0.85ANDsize_ratio ≥ 0.6), the pair is accepted anyway. See Type-3 / optional-segment detection.
Surviving edges feed a union-find that merges them into clusters.
Anti-unification
For every cluster phpdup computes the most-specific generalization of its members. The classic recursion:
au(t1, t2) =
if root(t1) == root(t2) and arity matches:
Node(root(t1), [au(c1_i, c2_i) for i in children])
else:
Hole(observed=[t1, t2])
Extensions in phpdup:
- Seed = max-size member. Cluster member with the most AST nodes is used as the template, so the "maximal" version drives the abstraction and shorter members get optional segments highlighted (instead of the alignment failing because the seed happened to be short).
- LCS for stmt arrays. When two stmts/cases/catches arrays differ
in length, phpdup runs LCS over per-statement structural hashes;
matched positions recurse, unmatched template positions become
optional_blockholes.
The resulting template has Hole markers at every position where
members disagreed. Each hole tracks its observed values across all
members (in cluster order), so reports show
threshold ∈ {10, 20, 30} and
role ∈ {'admin', 'moderator', 'editor'}.
Pattern recognition
After anti-unification, each cluster is checked against a small catalog of refactor archetypes (sql-builder, crud-handler, validation-chain, strategy, config-driven, state-machine, optional-segments). Tags are advisory; they don't change clustering, just label the cluster in the report.
Ranking
Each cluster gets two scores:
- Impact ≈
(members - 1) × avgBlockSize - holesPenalty. How many lines of code disappear if the abstraction is applied. - Confidence in
[0,1]. Cluster similarity, penalized for subtree-level holes (large variable subtrees) and cross-namespace spans, bumped for same-class cohesion.
Clusters below min_cluster_impact are dropped. Survivors are sorted by
descending impact, breaking ties by member count and similarity.
Parallelism
Phpdup\Parallel\WorkerPool partitions a list of items into N batches,
forks one child per batch via pcntl_fork, runs the closure in the
child, and the parent reaps results. There are two collection modes:
run()— collect-and-return. Each child writes its full result to a temp file when finished; the parent reads them all once every child has exited.runStreaming()— yield-as-results-arrive. Children write length-prefixed serialized records (4-byte big-endian uint32 + payload) to per-childstream_socket_pairs; the parent multiplexes viastream_selectand the returned\Generatoryields each record live.PreprocessStageconsumes this so the cooperative pipeline gets mid-stage progress events instead of blocking until every fork exits.
Two phases use the pool:
PreprocessWorker— each child does parse + extract + normalize- hash + n-gram fingerprint for its file batch.
PairScoreWorker— once candidate pairs are generated from the inverted index, the master batches them across workers; each child runs Jaccard + bounded TED + (when enabled) the type-3 containment fallback on its batch and emits surviving edges.
CPU count is auto-detected (nproc / /proc/cpuinfo) or overridable
via --workers N / PHPDUP_WORKERS=N. When pcntl_* is unavailable
(Windows, sandboxed PHP), the pool detects this at runtime and falls
back to a serial code path with the same closure interface — callers
don't branch.
Incremental indexing
Phpdup\Persistence\IndexStore snapshots each file's extracted +
normalized + fingerprinted blocks under
<cache_dir>/<sha1(path)>.idx. Each snapshot stores:
file_hash—sha1_file()of the source.parser_version— bumped together with the AST cache key.config_key— sha1 of the relevant config fields (block size, normalization mode, n-gram size). Changing any of these invalidates the snapshot automatically.blocks— serializedBlock[]ready to pour into the index.
On re-runs the master splits files into "reuse" (snapshot hit) and "process" (snapshot miss) buckets and only the latter goes through the worker pool. Editing one file leaves the other snapshots intact.
Disable with --no-incremental for benchmarking or when paranoid
about cache poisoning.
Lazy AST loading
After fingerprinting we drop Block::$ast (the original PhpParser
subtree) and reload it on demand inside AntiUnifier via
BlockAstLoader. The loader walks the file's parse-cached statement
list looking for the unique
(kind, start_line, end_line, declared_name) tuple; matches are
populated back into the Block.
The AST cache is consulted first so on warm runs no parsing happens at
all. Disable with --no-lazy-ast if you have RAM to spare and want
maximum speed (the reload overhead is roughly equal to the RSS savings
on small corpora — see BENCHMARKS.md).
Type-3 / optional-segment detection
A "type-3" clone is one where the structures match, but some members have extra (or missing) statements relative to others. phpdup detects these in two coordinated stages.
The fixture run shows phpdup pulling in two blocks whose statements
share a common prefix but where the longer block has two extra calls
(some_other_logic($here) and and_more($f)). The "Suggested
abstraction" box ends up with two defaulted boolean parameters
named after the absent code:
function extractedFunction( bool $includeSomeOtherLogic = false, bool $includeAndMore = false, ): mixed
…and the cluster is tagged optional-segments. Each row in the Holes
table is marked optional_block with the literal <absent> sentinel
showing which members lacked the segment.
Clustering: containment fallback
When the n-gram Jaccard between two candidate blocks falls below
similarity_threshold, the clusterer tries
ContainmentSimilarity = |A ∩ B|min / min(sum(A), sum(B)), which
returns 1.0 whenever the smaller bag is fully contained in the larger
regardless of size disparity. The pair is accepted with the
containment score as the edge weight only when:
containment ≥ optional_blocks_containment (default 0.85) AND
size_ratio ≥ optional_blocks_min_overlap (default 0.6)
The size-ratio guard prevents a 1-line block from clustering with a 100-line block on a single shared n-gram.
Anti-unification: LCS over statement arrays
The seed (template) is the cluster member with the most AST nodes.
When walk() reaches a stmts / cases / catches array whose
length differs from the seed's, phpdup runs LCS over each statement's
structural hash:
- Matched template positions recurse via the normal walk — variables, literals, names inside the matched statement still produce regular holes.
- Unmatched template positions become
optional_blockholes — one per missing statement, capped atoptional_blocks_max_per_cluster(default 3) so over-flexible clusters don't explode into seven-boolean signatures.
Each optional_block hole becomes a default-false bool parameter.
The name is derived from the first non-stop-word identifier in the
segment, e.g. a missing some_other_logic($here); becomes
bool $includeSomeOtherLogic = false. SignatureBuilder groups
required parameters first then defaulted booleans so the resulting
signature is syntactically valid PHP.
Tunables
{
"optional_blocks": {
"enabled": true,
"containment": 0.85,
"min_overlap": 0.6,
"max_per_cluster": 3,
"min_segment_length": 1
}
}
CLI overrides:
bin/phpdup analyze src \
--optional-blocks=on \
--optional-blocks-containment=0.85
To disable type-3 detection completely: optional_blocks.enabled = false (or --optional-blocks=off); the clusterer reverts to
Jaccard-only and AntiUnifier falls back to whole-array subtree holes
when stmt arrays differ in length.
How it surfaces in reporters
- CLI — Holes table shows
kind = optional_block, observed values include the literal<absent>marker for members where the segment was missing. - JSON — every optional_block hole carries
present_in_members: [int, ...]listing the cluster-member indices that did include the segment. - SARIF — each result's
propertiesaddsoptionalSegmentCountandhasOptionalSegmentsso PR-annotation tooling can flag type-3 clusters distinctly. - HTML — optional rows are tinted amber, get a "type-3" badge, and
italicize the
<absent>sentinel.
Type-4 / behavioural similarity (experimental)
Type-1/2/3 detection compares shape — what tokens the AST emits.
Type-4 detection compares behaviour — what data flows through
the block, what calls it makes, what it returns. Two functions that
compute the same value with structurally different implementations
(foreach accumulator vs array_reduce, recursion vs iteration,
switch vs match) cluster under type-4 even when AST similarity
rejects them.
phpdup's type-4 scaffold:
Phpdup\Semantic\DataflowSummarizer— walks the block once and emits(vars, calls, returns, sideEffects).Phpdup\Semantic\DbOperationTagger— separately walks the block and emits atag → countmultiset of recognised database operations (db.read,db.write,db.delete,db.execute,db.query). Library- and extension-agnostic by design — see Behavioural-tag scoring below.Phpdup\Similarity\BehaviouralSimilarity— weighted Jaccard over those summaries (var-set 1×, call-multiset 2×, return-shape 2×, side-effect-flag 1×, DB-op-tag multiset 2×) →[0,1]score.Phpdup\Semantic\CallGraphandPhpdup\Semantic\ControlFlowGraph— coarse per-block summaries used by future type-4 boost paths.
Type-4 has higher false-positive risk than type-1/2/3 by design.
Today it's a foundational scaffold; the live wiring into the
clusterer (as a fourth-tier fallback after Jaccard + APTED +
containment all reject) is gated behind a future --type4 flag.
See docs/algorithms/anti-unification.md
for the algorithmic reference and
docs/plans/orm-db-semantic-dedup.md
for how this connects to the broader plan for ORM/DB
semantic-equivalence detection.
ORM- / DB-aware semantic deduplication
Stock AST clustering misses duplicates that the developer thinks of as
"the same operation" because the surface call shape is different —
typically when the same database write/read is expressed via an ORM
in one place and via raw SQL in another. The four examples below all
perform the same write, but none cluster without --db-aware:
// Eloquent $user = User::find($id); $user->name = 'Bob'; $user->save(); // Doctrine $user = $em->find(User::class, $id); $user->setName('Bob'); $em->flush(); // Raw PDO $pdo->query("UPDATE users SET name = 'Bob' WHERE id = {$id}"); // Query builder $db->table('users')->where('id', $id)->update(['name' => 'Bob']);
Pass --db-aware and phpdup runs Phpdup\Normalization\DbOpCanonicalizer
as a pre-pass during normalisation. It rewrites recognised database
calls into canonical synthetic FuncCalls — __DB_FIND__("user"),
__DB_QUERY__("users", "SELECT"), __DB_WRITE__("users") etc. —
so equivalent variants produce identical token streams and cluster
together in tier-1 / tier-2.
What is recognised
The stock symbol table in Phpdup\Normalization\DbOpRegistry covers:
- Eloquent / Laravel —
Model::find,Model::all,Model::create,DB::table,DB::select,Model::where(...) ->first|get|update|delete(), etc. - Doctrine ORM —
EntityManager::find,EntityManager::flush,EntityManager::persist,Repository::findOneBy,Repository::findAll. - PDO —
PDO::query,PDO::prepare,PDOStatement::execute,PDOStatement::fetch*. - mysqli (object + procedural) —
mysqli::query,mysqli_query,mysqli_stmt_execute,mysqli_fetch_*. - PostgreSQL —
pg_query,pg_query_params,pg_fetch_*,pg_insert,pg_update,pg_delete. - Firebird / InterBase —
ibase_query,ibase_prepare,ibase_execute,ibase_fetch_row,ibase_fetch_assoc,ibase_fetch_object,ibase_commit,ibase_rollback. - MSSQL / DB-Library —
mssql_query,mssql_fetch_row,mssql_fetch_array,mssql_fetch_assoc,mssql_fetch_object,mssql_num_rows. - IBM DB2 —
db2_prepare,db2_execute,db2_query,db2_fetch_row,db2_fetch_assoc,db2_fetch_array,db2_fetch_object,db2_num_rows. - Async MySQL clients —
amphp/mysql(amysql_query,amysql_fetch_assoc,amysql_fetch_row,amysql_free_result), friends-of-reactphp/mysql (react_mysql_query,react_mysql_fetch_assoc,react_mysql_fetch_row), Swoole Coroutine MySQL (swoole_mysql_query,swoole_mysql_fetch_assoc,swoole_mysql_fetch_row), OpenSwoole MySQL (openswoole_mysql_query), Workerman mysql (workerman_mysql_query). - Async PostgreSQL clients —
amphp/postgres(apg_query,apg_fetch_assoc,apg_fetch_row,apg_free_result),reactphp/postgres(react_pg_query,react_pg_fetch_assoc,react_pg_fetch_row), Swoole coroutine postgres (swoole_postgres_query). - Oracle OCI8 —
oci_parse,oci_execute,oci_fetch,oci_fetch_assoc,oci_fetch_row,oci_fetch_object,oci_free_statement,oci_commit,oci_rollback. - Cassandra (phpcassa) —
phpcassa_query,phpcassa_fetch. - LevelDB —
leveldb_get,leveldb_put,leveldb_delete,leveldb_open. - Memcached —
memcached_get,memcached_set,memcached_add,memcached_replace,memcached_delete,memcached_increment,memcached_decrement,memcached_flush. - Async transaction methods —
beginTransaction,commit,rollbackare classifiedOP_WRITE; auxiliary methodsaffectedRows→OP_READ,insertId→OP_WRITE,count→OP_READ. - Generic CRUD verbs — any method named
find,findById,save,update,delete,query,executeon an unknown receiver — coarse but high-recall. Also:table,select,insert,upsert,lock,unlockvia the illuminate/database query builder facade (LaravelDB::table()->...chains). - Raw SQL strings — passed to any of the above. The bundled
Phpdup\Normalization\SqlTableExtractorlifts the verb (SELECT/INSERT/UPDATE/DELETE/REPLACE/TRUNCATE) and the primary table out of the literal string so the verb shows up as a token in the synthesised__DB_<OP>__call.
How it slots into the pipeline
1. ScanningStage
2. PreprocessStage
a. AstParser → plain AST
b. BlockExtractor → (file, kind, range)
c. Normalizer
├── DbOpCanonicalizer ← only when --db-aware
└── CanonicalizingVisitor (vars, literals, names)
d. SubtreeHasher + NgramFingerprint
3. ClusterStage
4. RefactorStage
5. ReportStage
DbOpCanonicalizer runs before the standard variable / literal /
name passes so the synthesised __DB_<OP>__ token names survive the
aggressive name-canonicalisation pass (they're treated as structural
function names, like isset or count).
Risk profile and false positives
--db-aware is intentionally biased toward high recall — a few
benign false positives (e.g. an unrelated query() method on a
non-DB class folding to __DB_QUERY__) are cheaper than missing
real ORM ↔ raw-SQL clones. Two safeguards:
- Off by default. Tier-1 AST-only clustering remains the
unmodified path; nothing in the existing report stream changes
without
--db-aware. - Cache invalidated on toggle.
Phpdup\Pipeline\Stages\PreprocessStageincludesdbAwarein its config-key hash, so flipping the flag between runs reprocesses every file rather than reusing stale-canonicalised blocks.
For per-project tuning, the DbOpRegistry constructor accepts
customMethodOps and customFunctionOps maps so you can extend or
override the stock dispatch table — wire that into a normalisation
plugin (see Normalisation plugins) for
project-specific rewrites.
Trinity-collapse — --trinity-collapse
--db-aware folds individual DB calls. The natural follow-up is
trinity-collapse: detecting the canonical CRUD shape
$user = User::find($id); // (1) read $user->name = 'Bob'; // (2) mutate $user->save(); // (3) save
…and rewriting the three statements as a single
__DB_UPSERT__("user") synthetic call so the ORM idiom clusters
with the raw equivalent
$pdo->query("UPDATE users SET name = 'Bob' WHERE id = $id");
(which --db-aware separately folds to a __DB_WRITE__ token).
Phpdup\Normalization\TrinityCollapser walks every statement-array
in the AST (function/method/closure bodies, if/else branches, loop
bodies) and looks for triplets of the shape read → mutate(s) → save
where:
- The read is any
DbOpRegistry::OP_READcall assigned to a variable ($x = User::find($id),$x = $em->find(User::class, $id),$x = $repo->findOneBy([...]), …). - The mutate(s) are property assignments (
$x->name = 'Bob',$x->name .= 'X') or setter calls ($x->setName('Bob'),$x->withFoo(...),$x->addThing(...)) on the bound variable. At least one mutation is required — read+save with no change in between is left untouched. - The save terminates the chain: receiver-bound writes
(
$x->save(),$x->update()), Doctrine flush ($em->flush()), or$em->persist($x)(with$xmatching the bound variable as the first argument).
Mutation detection improvements. The collapser recognises additional mutation patterns beyond simple property assignment:
- ArrayAccess / array assignment —
$x['key'] = valueviaNode\Expr\ArrayDimFetchon the bound variable. - Direct array append —
$x[] = valueappends on the bound variable. - Extended ORM mutator prefixes — setter-method prefixes
force,update,change,modifyare recognised in addition to the baseset/with/add/remove/append/replaceset, so idioms like$x->forceUpdate(...)or$x->changeName(...)are correctly classified as mutations.
Any unrelated statement between the read and save abandons the trinity — the dataflow walker is intentionally conservative; false misses are preferable to false matches.
--trinity-collapse composes with --db-aware (the typical
combination), and runs as a pre-pass before DbOpCanonicalizer
so the collapser can pattern-match on the original read/save call
shapes.
Behavioural-tag scoring
The Type-4 behavioural scorer (Phpdup\Similarity\BehaviouralSimilarity)
gains a fifth band: a DB op-tag multiset Jaccard weighted equally
with the existing call-name and return-shape bands.
Phpdup\Semantic\DbOperationTagger walks each block once and produces
a coarse tag → count summary:
['db.read' => 2, 'db.write' => 1, 'db.execute' => 1]
Two functions with the same DB-shape — same number of reads, writes,
deletes, executes, and queries — score similarly under the tag band
regardless of which library or extension delivers each operation.
That collapses the library/extension axis (Eloquent vs Doctrine vs
PDO vs mysqli vs pg_*) without needing per-pair surface analysis.
The tag band is a no-op for blocks that do not touch the database
(empty bag vs empty bag is 1.0 by convention) so non-DB code is
unaffected. The behavioural scorer's total weight is now 1 + 2 + 2 + 1 + 2 = 8 (variables + calls + returns + side-effects + DB tags),
normalised back to [0, 1].
The tagger reuses the same DbOpRegistry as DbOpCanonicalizer, so
the recognised call set is consistent across the canonicalisation
and scoring layers, and synthetic __DB_<OP>__ calls produced by
--db-aware / --trinity-collapse are also tagged correctly.
Symbol equivalence classes
The stock DbOpRegistry covers the obvious surface area but every
codebase has wrappers, helpers, and homegrown façades that do DB
work without using a recognised name. Option 4 of the plan
exposes a user-extensible symbol equivalence registry: declare in
phpdup.json (or in a profile JSON) that app_db_get,
MyRepo::lookup, and LegacyDb::raw all mean the same thing as
the stock entries — and they fold to the same canonical
__DB_<OP>__ tokens.
{
"db_aware": true,
"db_symbols": {
"methods": {
"lookup": "db.read",
"persistMe": "db.write",
"wipe": "db.delete"
},
"functions": {
"app_db_get": "db.read",
"app_db_query": "db.query"
}
}
}
Allowed canonical ops: db.read, db.write, db.delete,
db.execute, db.query. Custom entries override stock ones with
the same name; everything else is additive.
Bundled symbol packs. Twenty-eight framework-flavoured packs ship
out of the box and can be loaded via --profile:
| Profile name | What it adds |
|---|---|
db-aware-laravel |
Eloquent / Laravel methods (firstWhere, pluck, chunk, increment, raw* helpers). |
db-aware-doctrine |
Doctrine ORM / DBAL (createQuery, executeStatement, fetchAssociative*, transaction helpers). |
db-aware-cake |
CakePHP ORM (patchEntity, saveOrFail, findThreaded, loadInto). |
db-aware-codeigniter |
CodeIgniter 4 DB Query Builder (get, insert, update, delete, countAll, escape). |
db-aware-thinkorm |
ThinkPHP 6.x / think-orm (find, select, insert, update, delete, count, aggregation). |
db-aware-yii |
Yii 2 ActiveRecord / DB (find, findOne, findAll, save, insert, update, delete, createCommand). |
db-aware-medoo |
Medoo (select, insert, update, delete, create, drop, query, exec). |
db-aware-propel |
Propel ORM (doSelect, doInsert, doUpdate, doDelete, find, save). |
db-aware-redbean |
RedBeanPHP (find, dispense, store, trash, save, load, wipe, related). |
db-aware-cycle |
Cycle ORM (find, findAll, persist, delete, select, where, aggregate). |
db-aware-phpactiverecord |
PHP ActiveRecord (find, all, first, last, create, save, update, delete, destroy). |
db-aware-myadmin |
MyAdmin custom db_abstraction (MyDb\Mysqli\Db, MyDb\Pdo\Db — query, qr, next_record, prepare, execute). |
db-aware-myadmin-orm |
MyAdmin custom ORM (MyAdmin\Orm\* extending Base\Orm — find, load, save, update, insert, delete, truncate). |
db-aware-illuminate |
Illuminate Database standalone (DB::table(), DB::select(), DB::insert(), DB::update(), DB::delete(), DB::raw()). |
db-aware-aura |
Aura.Sql (fetchAll, fetchOne, query, execute, quote, begin/commit/rollback). |
db-aware-atlas |
Atlas.PDO (fetchAll, fetchAssoc, fetchSelect, begin/commit/rollback). |
db-aware-easydb |
EasyDB (query, fetchAll, fetchOne, iterator, safeQuery, build). |
db-aware-dibi |
Dibi (query, fetchAll, fetch, test, begin/commit/rollback). |
db-aware-pixie |
Pixie Query Builder (table, get, insert, update, delete, where, orderBy). |
db-aware-redis |
Redis (Predis, Credis, phpredis) — get, set, mget, mset, hget, hset, del, expire, etc. |
db-aware-mongodb |
MongoDB driver (find, findOne, insertOne, insertMany, updateOne, deleteOne, aggregate, count, etc.). |
db-aware-elasticsearch |
Elasticsearch PHP client (search, index, get, mget, bulk, delete, update, count, scroll, msearch). |
db-aware-neo4j |
Neo4j PHP client (run, match, create, merge, set, delete, detachDelete). |
db-aware-influxdb |
InfluxDB client (query, write, ping, bucket, organization, flux). |
db-aware-couchdb |
Doctrine CouchDB ODM (find, findBy, persist, save, remove, delete, refresh). |
db-aware-couchbase |
Couchbase SDK (get, upsert, insert, replace, remove, lookupIn, mutateIn, view, query). |
db-aware-idiorm |
Idiorm/Paris (find, findOne, where, orderBy, count, create, save, update, delete). |
db-aware-laminas |
Laminas\Db (formerly Zend\Db) adapter (select, fetchAll, fetchOne, query, insert, update, delete). |
db-aware-phalcon |
Phalcon ORM (find, findFirst, save, create, update, delete, refresh, query, aggregate, sum, count, average, min, max). |
Compose them with --db-aware for the canonicalisation pass plus
the stock DB call coverage. The user's own db_symbols in
phpdup.json win over profile-provided symbols, which in turn
win over the stock registry.
IR — intermediate representation lift
Phpdup\Ir\ is the foundational scaffold for option 5 of the
plan: a canonical, language-/library-agnostic representation that
sits between the PHP AST and the similarity scorer.
Block (PHP AST) → IR (canonical operation graph) → fingerprint / score
The IR replaces PHP-specific syntax (the difference between $x->y
and $x['y'], between for and foreach, between if/else and
match) with a small set of operation-shaped nodes:
| IR node | Lifts from |
|---|---|
DbReadIr |
Model::find, $em->find(...), $pdo->query("SELECT …"), pg_* reads. |
DbWriteIr |
$x->save, $em->flush, $pdo->query("UPDATE …"), pg_insert/update. |
DbDeleteIr |
Model::destroy, $em->remove, DELETE FROM …, TRUNCATE …. |
DbQueryIr |
Generic query/exec calls with no clean read/write classification. |
DbExecuteIr |
prepare/execute-style two-phase calls. |
AssignIr |
Local assignments — LHS shape collapsed to var/prop/index/… |
BranchIr |
if/else, match, switch (unfolded into nested branches), ternaries. |
LoopIr |
for, foreach, while, do-while — keyword distinction erased. |
CallIr |
Any unrecognised method/function/static call. |
ReturnIr |
return [expr]; and bare return;. |
VarIr |
Variable references (name collapsed to __V). |
LiteralIr |
Scalar literals (only the type — str/int/float/bool/null — survives). |
BlockIr |
Statement sequences (function/method bodies, branch arms, loop bodies). |
Phpdup\Ir\IrLifter walks a PhpParser AST and produces an IR tree.
Lifting is partial by design: unrecognised input shapes (e.g.
eval, goto, complex variable-variables) fall through to a
generic CallIr carrying the node class name; if the lift fails
outright the lifter returns null and callers fall back to
AST-level scoring (per the plan's risk-mitigation note).
Phpdup\Ir\IrPrinter produces a deterministic token stream and a
human-readable pretty-print. Phpdup\Ir\IrSimilarity scores two
IR trees: 1.0 for identical printed-token streams (a SHA-1-hash
fast path is exposed via IrSimilarity::hash()), then a
multiset-Jaccard fallback for partial overlap.
The IR scorer is wired into the Clusterer as a fifth tier behind
--scorer=ir. When AST Jaccard / TED / containment all reject a
pair, the clusterer falls back to multiset Jaccard over the
pre-computed IR token bags (Block::$irBag, populated by
PreprocessWorker when scorer=ir); pairs at or above
--ir-threshold (default 0.85) form edges weighted by the IR
similarity. Lift failure on either block leaves irBag null and
the IR tier silently skips that pair, preserving the plan's
risk-mitigation note ("fall back to AST scoring on lift failure").
ML-learned pair similarity (sidecar)
phpdup ships a sidecar contract for option 6 of the plan: an external ML pair-scoring service trained on a labelled corpus that includes ORM ↔ raw-SQL examples. The PHP side stays light:
Phpdup\Ml\PairFeaturesextracts an 11-field feature vector from a(blockA, blockB)pair using the existing dataflow summary, DB op tags, and IR token-stream — no ML libraries required.Phpdup\Ml\MlPairClientis the HTTP companion to the existingMlClient(cluster-safety scoring). It POSTs the feature vector to/score-pairon a configured sidecar and consumes a single{similarity, confidence}response.MlPairClient::score()returnsnullon transport error, malformed response, or unsafe URL — callers fall back to AST-level scoring without code-path branching, mirroring the IR lifter's fail-graceful pattern.
The wire contract is documented in docs/ML.md; the
labelled-corpus format the sidecar expects is documented in
docs/ml-corpus-format.md. A
feature_version field is embedded in every payload so the sidecar
can warn when the model was trained against a different feature
shape (the PairFeatures::FEATURE_VERSION constant bumps any time
the schema changes).
The model itself, the training pipeline, and the labelled corpus all live in a sister repository — phpdup ships only the contract and the feature extractor.
Once a sidecar is running, point phpdup at it via --ml-pair-url
(or ml_pair_url in phpdup.json):
bin/phpdup analyze src --ml-pair-url=https://ml.example.com/api \
--ml-pair-threshold=0.85
Phpdup\Clustering\Clusterer and Phpdup\Parallel\PairScoreWorker
both consult the configured Phpdup\Ml\PairScorer as the last
clustering tier — after structural-hash, AST Jaccard + TED,
containment, and IR have all rejected a pair. The client returns
null on transport failure (model unavailable, SSRF-rejected URL,
malformed response) so the rest of the pipeline keeps running with
reduced precision rather than failing the run, mirroring the IR
lifter's fail-graceful contract.
TUI mode
--tui opens a SugarCraft dashboard that drives the analysis pipeline
from inside the runtime — counts and timings tick up as work
progresses instead of appearing post-hoc.
What you get:
- Four-pane FlexBox dashboard with live counts for Scanning / Preprocessing / Clustering / Refactoring.
- SugarCraft spinner in the running pane (one of 12 animation styles) plus a real elapsed-time stopwatch.
- Sparkline of stage durations as each stage completes.
- OSC 9;4 taskbar progress — ConEmu, WezTerm, and Windows Terminal pin a progress indicator on the OS taskbar.
- Six themes:
ansi(default),plain,charm,dracula,nord,catppuccin— pick via--theme=NAME. - Keyboard:
q/Ctrl+Cquit,Ctrl+Zsuspend,↑/↓cycle pane focus,Enteropen detail,←/→cycle clusters in detail view,ttoggle sort (impact / similarity / name),htoggle help,Escdismiss detail. --plainforces plain CLI output even when--tuiwould otherwise fire (handy for CI and pipes).
The pipeline runs inside the runtime via a cooperative
Pipeline::iter() generator. Each next() advances to the next
yield point — pre-stage, post-stage, every 16 files in
ScanningStage, every 32 records streamed back from the parallel
preprocess pool. Between yields the runtime renders, so the
sparkline, file counts, parse-error totals, and taskbar progress
build up frame-by-frame.
Watch mode
--watch keeps phpdup running and re-analyzes on every file change.
A React\EventLoop periodic timer (default 1.5 s) polls
filemtime() for every scanned file, with clearstatcache() before
each read to defeat PHP's stat cache. When any mtime changes, phpdup
re-runs the full pipeline and reports change detected (N files) — reload #X.
Polling instead of inotify / FSEvents keeps the watcher
dependency-free and portable to macOS / Linux without an extension —
at the cost of a small (≤ 1.5 s) reload latency.
Ctrl+C (or SIGTERM) triggers a clean teardown via
Loop::addSignal. --watch --tui is supported — the watcher's
periodic timer registers on the same React\EventLoop instance the
SugarCraft Program runs on, so the dashboard stays interactive
while polling. On change, the watcher dispatches a
RestartPipelineMsg to the program; the model resets state, rebuilds
the cooperative generator via its factory, and the live counts in the
panes drop to zero before climbing back up.
SIGINT soft-cancel
Pressing Ctrl+C mid-analysis (outside --watch) doesn't drop the
work in progress. Phpdup\Cli\SignalHandler registers an async
SIGINT handler that flips PipelineState::$cancelled; the
cooperative pipeline checks this between stages and short-circuits
straight to the Reporting stage. The user gets a partial report
covering whatever clusters had already formed, plus a non-zero
exit code (130, the canonical SIGINT code) so CI workflows can
distinguish "user cancelled" from "clean run".
A second Ctrl+C restores the default SIGINT handler so a truly
stuck process can still be killed with the third-time-is-default
escape hatch.
phpdup serve REST API
phpdup serve boots a minimal HTTP service on top of the analysis
pipeline. Useful for in-house dashboards, CI integrations, and a
hosted-backend playground front-end.
phpdup serve --host 127.0.0.1 --port 8080
Routes:
| Method | Path | Body | Returns |
|---|---|---|---|
| GET | /healthz |
— | text/plain "ok" |
| POST | /analyze |
JSON: {"paths":["src","lib"]} |
full JSON report |
| POST | /jobs |
JSON: same as /analyze |
202 {"job_id":"…"} |
| GET | /jobs/{id} |
— | job status + result |
Implementation is intentionally dependency-light:
stream_socket_server plus a hand-rolled HTTP/1.1 parser. No
ReactPHP / Amp dep — for high-traffic deployments, swap in
Roadrunner / FrankenPHP / FPM behind nginx and reuse
Phpdup\Server\Application directly.
Security defaults:
- Default bind is
127.0.0.1— exposing it externally requires a reverse proxy. - Content-Length is capped at 16 MB.
- JSON parsing uses
JSON_THROW_ON_ERRORand returns 400 on malformed bodies. - The
Applicationclass is fully transport-agnostic and unit-tested without networking.
See docs/SERVER.md for the full contract,
deployment notes, and the playground architecture.
Output formats
bin/phpdup analyze src \
--json phpdup.json \
--html phpdup-report \
--sarif phpdup.sarif \
--gitlab-sast phpdup.gitlab.json \
--diff ./phpdup-diffs \
--patch phpdup.patch \
--checkstyle phpdup.xml \
--csv phpdup.csv \
--prometheus phpdup.prom \
--timeseries phpdup-history.jsonl \
--graphviz phpdup.dot \
--plantuml phpdup.puml \
--refactor-patch ./phpdup-refactors \
--refactor-tests ./phpdup-refactor-tests
phpdup ships twelve output formats; the additions over the classic five are:
- CSV (
--csv=FILE) — flat one-row-per-cluster-member; ideal for spreadsheets / BI ingest. Multi-line signatures are collapsed to one line so consumers don't have to handle embedded newlines. - Prometheus text-format (
--prometheus=FILE) —# HELP/# TYPEannotated gauges suitable forpushgatewayscraping or CI dashboards. Per-pattern-tag counters included. - Time-series JSONL (
--timeseries=FILE) — one append-only line per run, commit-tagged viaGIT_COMMIT/GITHUB_SHA/CI_COMMIT_SHA/BUILD_VCS_NUMBER(with.git/HEADfallback) so duplicate-debt curves over time are trivial to track. - Graphviz DOT (
--graphviz=FILE) — file→cluster bipartite graph; render withdot -Tpng phpdup.dot -o phpdup.png. - PlantUML (
--plantuml=FILE) — class-diagram-style with each cluster as apackageand pattern-tag stereotypes. - Refactor patches (
--refactor-patch=DIR) — heuristic, manual- review-required.patchfiles. Each adds aRefactored/<id>.phpskeleton with the suggested abstraction signature plus per-member edit hints. Bails to a manual-review header when$this/self::/yield/closure capture would make mechanical replacement unsafe. - PHPUnit test skeletons (
--refactor-tests=DIR) — onemarkTestIncomplete()test class per cluster with acasesProvider()populated from observed hole values.
CLI
SugarCraft-styled colorized terminal output. Honors --no-ansi /
non-TTY: switches to Theme::plain() and skips the styled box / chips,
producing clean ASCII.
JSON (--json=FILE)
{
"phpdup_version": "0.1.0",
"summary": { "files": 1888, "blocks": 12340, "clusters": 87 },
"clusters": [
{
"id": "Xaeb0e34a",
"kind": "method",
"exact": true,
"similarity": 1.0,
"confidence": 1.0,
"impact": 74,
"pattern_tags": ["config-driven", "crud-handler", "sql-builder"],
"signature": "function findById(string $value): mixed",
"members": [ ... ],
"holes": [
{
"placeholder": "__P0",
"kind": "literal",
"inferred_type": "string",
"suggested_name": "$value",
"observed": [
"'SELECT * FROM users WHERE id = ?'",
"'SELECT * FROM products WHERE id = ?'"
]
},
{
"placeholder": "__O0",
"kind": "optional_block",
"inferred_type": "bool",
"suggested_name": "$includeAudit",
"observed": ["audit($id);", "<absent>"],
"present_in_members": [0]
}
]
}
]
}
holes[].present_in_members is unique to optional_block holes — it
lists the cluster-member indices that did include the segment.
HTML (--html=DIR)
A static-site report with:
- Index page sorted by impact, with an interactive mini-map of cluster-impact distribution (green bars exact, blue bars near-duplicate; click to jump).
- Client-side column sort (click any header) and a search filter input.
- Copy-suggested-signature button per cluster (Clipboard API + an
execCommandfallback for non-secure contexts). - Per-cluster page with member sources side-by-side, inline syntax highlighting (no external dep, no build step), holes table, unified diff between the first two members.
- Optional rows tinted amber, with a "type-3" badge and italicized
<absent>sentinel.
JS lives in app.js next to style.css — both inlined into the
build, no external deps, no build step.
SARIF (--sarif=FILE)
SARIF 2.1.0
output for GitHub Code Scanning and GitLab Code Quality. Each
duplicate block becomes a result with rule
phpdup/duplicate-logic, level warning for exact / note for
near-duplicate, the cluster's suggested signature in
properties.suggestedSignature, and shared
partialFingerprints.clusterId so SARIF consumers can group sibling
results into a single cluster annotation. Type-3 clusters add
properties.optionalSegmentCount and properties.hasOptionalSegments.
In a GitHub Actions workflow:
- run: vendor/bin/phpdup analyze src --sarif=phpdup.sarif --min-impact=50 - uses: github/codeql-action/upload-sarif@v3 if: always() with: sarif_file: phpdup.sarif
GitLab SAST (--gitlab-sast=FILE)
GitLab SAST report v15.x.
Severity is impact-bucketed (>100 High, >=50 Medium, >=20 Low,
otherwise Info). Confidence is High for exact clones, otherwise
derived from the cluster confidence score. Wire it into
.gitlab-ci.yml as a report.sast artifact so the MR security
widget surfaces duplicates.
Diff and patch (--diff=DIR, --patch=FILE)
--diff=DIR writes one .diff file per cluster — pairwise unified
diffs from member[0] to each subsequent member, with a header comment
showing the suggested abstraction and anchor location. --patch=FILE
concatenates every cluster's diff into one cumulative patch file.
Checkstyle XML (--checkstyle=FILE)
Consumable by Jenkins Warnings NG, Bitbucket Reports, Sonar, Detekt
— anything that parses Checkstyle. Each duplicate is an <error>
with source="phpdup.duplicate-logic", severity warning for exact
/ info for near-duplicate, and a descriptive message linking back
to the cluster id.
Configuration
Drop a phpdup.json next to your code, or pass --config:
{
"paths": ["src", "app", "lib"],
"exclude": ["vendor/**", "node_modules/**", "**/*.tpl.php", "tests/**"],
"min_block_size": 8,
"max_block_size": 800,
"normalization_mode": "aggressive",
"similarity_threshold": 0.80,
"tree_threshold": 0.85,
"min_cluster_impact": 20,
"max_df": 0.01,
"ngram_size": 5,
"cache_dir": ".phpdup-cache",
"workers": 0,
"incremental": true,
"lazy_ast": true,
"kinds": ["method", "closure"],
"optional_blocks": {
"enabled": true,
"containment": 0.85,
"min_overlap": 0.6,
"max_per_cluster": 3,
"min_segment_length": 1
},
"report": {
"html": "phpdup-report",
"json": "phpdup.json"
}
}
The full schema lives at
docs/config-schema.json and is validated
by ConfigLoader::validate() whenever a config file is loaded.
CLI flags override config values. Run bin/phpdup analyze --help for
the full list, or see CLI reference below for
per-flag details and effect snippets.
bin/phpdup analyze --config phpdup.json --validate-config runs
validation in isolation and exits with 0 (OK) or 2 (with the
field-path error message), so CI can gate on config drift before any
analysis runs.
CLI reference
Usage: phpdup analyze <paths...> [options]
Tuning
--min-block-size N (default: 8)
Minimum AST node count for a block to be considered. Lower picks up small fragments (often noisy); higher quiets the report by ignoring short blocks.
bin/phpdup analyze src --min-block-size=4 # very chatty — short snippets clustered bin/phpdup analyze src --min-block-size=20 # only meaningful methods
--mode strict|default|aggressive (default: aggressive)
Normalization mode. See Normalization modes.
bin/phpdup analyze src --mode=strict # variable rename only — fewer false positives bin/phpdup analyze src --mode=aggressive # also collapse names + literals — more findings
--similarity N (default: 0.80)
Jaccard similarity threshold for the near-duplicate phase. Below this, the type-3 fallback may still apply.
bin/phpdup analyze src --similarity=0.95 # strict — very obvious clones only bin/phpdup analyze src --similarity=0.65 # loose — many type-2 clones surfaced
--max-df N (default: 0.01)
Maximum document-frequency for an n-gram to be used as a candidate-pair seed. Tuned for large codebases; bump for tiny fixtures where every n-gram is "common" by ratio.
bin/phpdup analyze tests/Fixtures --max-df=0.5 # tiny corpus needs higher cutoff bin/phpdup analyze big-codebase --max-df=0.001 # very strict on a 50k-block corpus
--min-impact N (default: 20)
Minimum cluster impact (≈ duplicated-line count) to include in output. Quiets the report; doesn't change clustering.
bin/phpdup analyze src --min-impact=50 # only show big wins bin/phpdup analyze src --min-impact=1 # show everything that clustered
--exact-only
Skip the near-duplicate phase entirely. ~6× faster on large corpora; emits only Type-1 (canonical-hash-equal) clusters.
bin/phpdup analyze src --exact-only --min-impact=50
--kinds K1,K2,...
Comma-separated block kinds to extract. Default = all of:
function|method|closure|arrow|if|for|foreach|while|do|try|switch|match.
bin/phpdup analyze src --kinds=method # methods only bin/phpdup analyze src --kinds=method,closure # methods + closures bin/phpdup analyze src --kinds=if,foreach,switch # control structures only
--max-memory MB
Soft memory ceiling. When peak RSS exceeds this mid-pipeline,
phpdup logs phpdup: peak RSS X MB exceeded --max-memory=Y and
suggests --exact-only. 0 (default) disables.
bin/phpdup analyze huge-monorepo --max-memory=1024 # warn if RSS > 1 GB
--optional-blocks on|off (default: on)
Type-3 / "optional-segment" detection master switch. See Type-3 / optional-segment detection.
bin/phpdup analyze src --optional-blocks=off # legacy Jaccard-only behaviour
--optional-blocks-containment N (default: 0.85)
Containment-fallback threshold for the type-3 path.
bin/phpdup analyze src --optional-blocks-containment=0.95 # very strict bin/phpdup analyze src --optional-blocks-containment=0.7 # permissive
Output
--html DIR
Write the interactive HTML report into DIR.
bin/phpdup analyze src --html=phpdup-report
# open phpdup-report/index.html in a browser
--json FILE
Structured JSON dump.
bin/phpdup analyze src --json=phpdup.json
jq '.clusters | length' phpdup.json
--sarif FILE
SARIF 2.1.0 output for GitHub PR annotations.
bin/phpdup analyze src --sarif=phpdup.sarif
gh codeql-action upload-sarif phpdup.sarif # in CI
--gitlab-sast FILE
GitLab SAST report v15.x for the MR security widget.
bin/phpdup analyze src --gitlab-sast=gl-sast-report.json
--diff DIR
One .diff file per cluster (pairwise unified diffs from member[0]).
bin/phpdup analyze src --diff=./phpdup-diffs ls phpdup-diffs/*.diff | wc -l
--patch FILE
Single cumulative patch file containing every cluster diff.
bin/phpdup analyze src --patch=phpdup.patch
--checkstyle FILE
Checkstyle XML for Jenkins / Sonar / Bitbucket consumers.
bin/phpdup analyze src --checkstyle=phpdup.xml
--limit N (default: 50)
Maximum number of clusters to print to the terminal. Doesn't affect file outputs.
bin/phpdup analyze src --limit=10 # top 10 in the CLI bin/phpdup analyze src --limit=1000 # show everything
--sort KEY[:asc|desc] (default: impact:desc)
Order clusters by a chosen attribute and direction. Affects every
output format (CLI, JSON, HTML, SARIF, GitLab SAST, Diff, Checkstyle)
because it runs inside the Ranker after impact + confidence are
computed.
| Key | Sorts clusters by … |
|---|---|
impact |
Estimated lines saved if the abstraction is applied (the default). |
members |
Number of duplicate blocks in the cluster (alias: size, count). |
block-size |
Average AST node count per member (proxy for "how big each duplicate is"). |
lines |
Total duplicated lines across all members. |
similarity |
Edge weight (1.0 for exact, lower for type-2/3 near-dupes). |
confidence |
The Ranker's [0..1] "how safe is this refactor" score. |
name |
First member's qualified name (Namespace\Class::method), alphabetical. |
file |
First member's file path, alphabetical. |
id |
Cluster id, alphabetical (mainly useful for stable diffs across runs). |
Direction defaults to desc. Use :asc / :desc, or the shortcut
prefixes - (desc) and + (asc):
bin/phpdup analyze src --sort=members:desc # most-duplicated clusters first bin/phpdup analyze src --sort=block-size:desc # biggest blocks first bin/phpdup analyze src --sort=lines # most duplicated lines (desc default) bin/phpdup analyze src --sort=similarity:asc # weakest matches first — review marginal type-3 bin/phpdup analyze src --sort=confidence:desc # safest refactors first bin/phpdup analyze src --sort=name:asc # alphabetical bin/phpdup analyze src --sort=-impact # leading - = desc shortcut bin/phpdup analyze src --sort=+lines # leading + = asc shortcut
Ties are broken consistently (members DESC ▸ similarity DESC ▸ id ASC) so the same input always produces the same final ordering regardless of the user's primary direction.
In the TUI, the t key cycles through the same sort keys live (impact
→ members → block-size → lines → similarity → confidence → name → wrap),
and shift-T toggles asc/desc. The current sort is shown in the Tip line.
--stats
Print pipeline stage timings, block-kind histogram, and worker info after the report.
bin/phpdup analyze src --stats
Runtime
-c, --config FILE
Load settings from a phpdup.json file. CLI flags override file
values.
bin/phpdup analyze --config=phpdup.json src
-j, --workers N (default: 0 = auto)
Worker count for parallel preprocess + pair scoring. 0 autodetects
from nproc / /proc/cpuinfo. 1 forces serial.
bin/phpdup analyze src --workers=8 # explicit bin/phpdup analyze src -j 1 # serial — debugging
--no-cache
Don't read or write the AST cache for this run.
bin/phpdup analyze src --no-cache # benchmarking, or after upgrading php-parser
--no-incremental
Disable per-file index reuse. Forces re-fingerprinting every file even when the IndexStore has a hit.
bin/phpdup analyze src --no-incremental # benchmarking, or paranoid about cache
--no-lazy-ast
Keep all original ASTs in memory throughout the run. Higher RSS, slightly faster anti-unification.
bin/phpdup analyze src --no-lazy-ast # have RAM, want speed
--stage NAME
Halt the pipeline after STAGE (one of scanning, preprocessing,
clustering, refactoring, reporting). Useful for debugging
incremental cache hits or profiling individual stages.
bin/phpdup analyze src --stage=preprocessing --stats # measure how long parsing takes bin/phpdup analyze src --stage=clustering # see clusters before refactor synthesis
TUI / watch
--tui
Show the interactive SugarCraft dashboard while analysis runs. Requires a real TTY.
bin/phpdup analyze src --tui
--theme NAME (default: ansi)
TUI theme: ansi | plain | charm | dracula | nord |
catppuccin.
bin/phpdup analyze src --tui --theme=dracula bin/phpdup analyze src --tui --theme=catppuccin
--plain
Force plain CLI output (no TUI, no ANSI colours). Useful when a CI
shell reports isatty()=true but you want non-coloured output.
bin/phpdup analyze src --plain
--watch
Stay running and re-analyze on file changes via a poll-based
React\EventLoop timer. Combines with --tui for a live dashboard.
bin/phpdup analyze src --watch # plain mode, prints reload messages bin/phpdup analyze src --watch --tui # interactive dashboard, resets on change
Validation
--validate-config
Validate the --config file against the documented schema and exit
without running analysis. Exit 0 = OK, 2 = error (with field
path).
bin/phpdup analyze --config=phpdup.json --validate-config && echo OK
Shell completion
phpdup ships its own completion sub-command. The output is the standard
Symfony Console completion script for the chosen shell, with commented-out
installation instructions prepended so you can cat the dump and follow the
steps inline:
bin/phpdup completion bash # → bash script + comments showing 3 install paths bin/phpdup completion fish # → fish script + comments bin/phpdup completion zsh # → zsh script + comments (#compdef stays first)
The most common one-liner per shell:
# bash — per-user completion under XDG home mkdir -p ~/.local/share/bash-completion/completions phpdup completion bash > ~/.local/share/bash-completion/completions/phpdup # fish — auto-loaded on next shell start mkdir -p ~/.config/fish/completions phpdup completion fish > ~/.config/fish/completions/phpdup.fish # zsh — pick a directory on $fpath BEFORE compinit mkdir -p ~/.zsh/completions phpdup completion zsh > ~/.zsh/completions/_phpdup # then in ~/.zshrc, before 'compinit': fpath=(~/.zsh/completions $fpath)
When shell is omitted, $SHELL is consulted. Unknown shells exit 2.
Exit codes
| Code | Meaning |
|---|---|
0 |
Analysis ran. Note: phpdup does NOT exit non-zero |
| when clusters are found. Use the JSON report to gate | |
CI; an empty clusters array means clean. |
|
1 |
Internal error. |
2 |
Missing required argument or invalid configuration. |
Environment variables
| Variable | Effect |
|---|---|
PHPDUP_WORKERS |
Override worker count (lower precedence than -j). |
COLUMNS |
Override terminal width detection for the CLI report. |
Programmatic use
The pipeline is fully composable from PHP. The simplest entrypoint is
Pipeline itself; everything else is plugged into it via stage
constructors:
use Phpdup\Cli\Config; use Phpdup\Pipeline\Pipeline; use Phpdup\Pipeline\PipelineState; use Phpdup\Pipeline\Stages\ScanningStage; use Phpdup\Pipeline\Stages\PreprocessStage; use Phpdup\Pipeline\Stages\ClusterStage; use Phpdup\Pipeline\Stages\RefactorStage; use Phpdup\Pipeline\Stages\ReportStage; use Symfony\Component\Console\Output\NullOutput; $config = new Config( paths: ['src'], exclude: ['vendor/**'], optionalBlocksEnabled: true, ); $state = new PipelineState($config); (new Pipeline([ new ScanningStage(), new PreprocessStage(useCache: true), new ClusterStage(exactOnly: false), new RefactorStage(useCache: true), new ReportStage(limit: 50, showStats: false), ]))->run($state, new NullOutput()); foreach ($state->clusters as $c) { echo "{$c->size()} members, signature: {$c->signature}\n"; }
For finer-grained control — e.g. to run preprocessing without
clustering, or to drive the pipeline cooperatively from your own
event loop — use Pipeline::iter() and pump the generator yourself.
That's exactly how the TUI works.
Examples
Threshold-gated notification (type-2)
Input:
public function notifyHigh($user, int $score): void { if ($score > 10) { $this->mailer->send('admin', $user); } } public function notifyMid($user, int $score): void { if ($score > 20) { $this->mailer->send('moderator', $user); } }
Output:
function notifyByThresholdAndStrategy(
int $threshold,
string $value,
): mixed
Holes:
| Param | Type | Observed |
|---|---|---|
$threshold |
int | 10, 20 |
$value |
string | 'admin', 'moderator' |
Patterns: config-driven.
Repository CRUD (type-1 / type-2)
Three classes with findById($db, $id) differing only in table name.
function findById(string $value): mixed
| Param | Type | Observed |
|---|---|---|
$value |
string | 'SELECT * FROM users WHERE id = ?', |
'SELECT * FROM products WHERE id = ?', |
||
'SELECT * FROM orders WHERE id = ?' |
Patterns: config-driven, crud-handler, sql-builder.
Optional segments (type-3)
The user-asked-for case — see
Type-3 / optional-segment detection
for the algorithm. Two if bodies share the same first three
statements; the longer one has two extra calls at the tail.
function extractedFunction(
bool $includeSomeOtherLogic = false,
bool $includeAndMore = false,
): mixed
| Param | Type | Kind | Observed |
|---|---|---|---|
$includeSomeOtherLogic |
bool | optional_block |
some_other_logic($here);, <absent> |
$includeAndMore |
bool | optional_block |
and_more($f);, <absent> |
Patterns: optional-segments.
Strategy dispatch
A chain of if (...) calls each invoking a different validator, all
with the same shape. Cluster tagged strategy, single hole on the
call name, with the list of method names as observed values — a
clear hint to extract an interface and an array of strategies.
Static analysis & config validation
CI runs three layers of static checks on every push and PR:
- run: find src tests -name '*.php' -print0 | xargs -0 -n1 -P4 php -l > /dev/null - run: vendor/bin/phpstan analyse --memory-limit=1G --no-progress - run: vendor/bin/psalm --no-progress --no-cache
- PHPStan level 6 is clean (no baseline). Configuration in
phpstan.neon. - Psalm runs at error level 6 with a tracked baseline
(
psalm-baseline.xml) for legacyInvalidArrayOffset/MissingParamTypefindings in the APTED implementation. - Config schema validation —
ConfigLoader::validate()mirrorsdocs/config-schema.jsonand throws aRuntimeExceptionwhose message names the offending field on the first violation.--validate-configruns validation in isolation for CI gating.
Benchmarks
phpdup ships a real comparative benchmark harness — not synthetic microbenchmarks. The suite runs phpdup alongside the other PHP duplicate-detection tools that anyone might reach for, on a curated mix of public OSS corpora, and reports wall time, peak RSS, cluster counts, plus precision / recall / F1 against ground truth. It's designed to be honest — phpdup is not the best tool in every cell, and the matrix below says so out loud.
Comparative benchmark suite
bench/run-all.sh
That's the whole flow:
- Auto-download
phpcpd.phar(phar-only; ~3 MB). npm install jscpd@4intobench/tools/(skipped if no node).- Shallow-clone OSS corpora (Symfony Console, Laravel HTTP, PHPUnit, WordPress core).
- Generate the synthetic-fuzz corpus deterministically.
- Run every available tool against every corpus with a per-tool wall-time cap.
- Score each tool's output against the synthetic corpus's
.ground-truth.json.
Outputs:
bench/results/latest.md— per (tool, corpus) wall time / RSS / cluster count.bench/results/detection-rate.md— per-tool precision / recall / F1 on the synthetic corpus (the only fair-fight scenario, since we know the right answer).
Tools probed (auto-skipped when missing — — in the table):
| Tool | Source | Auto-installed |
|---|---|---|
| phpdup | bin/phpdup (this repo) |
always |
| phpcpd | bench/tools/phpcpd.phar |
yes |
| pmd-cpd | system pmd cpd |
no — install separately |
| jscpd | bench/tools/node_modules/.bin/jscpd |
yes (npm) |
| simian | system simian |
no — commercial |
See bench/README.md for adding new tools or
corpora, and bench/feature-matrix.md for the capability comparison.
Feature matrix
A hand-curated 40+ row capability comparison vs phpcpd, pmd-cpd,
jscpd, and simian — full table in
bench/feature-matrix.md. The summary:
Where phpdup wins:
- Suggested-refactor output. No other tool produces a parameterised function signature, hole inventory, and unified-diff patches. If your goal is "I want to refactor the duplicates", phpdup is the only practical choice.
- Type-3 detection.
--optional-blocksfinds clones that differ in optional segments — phpcpd / pmd-cpd / jscpd require contiguous identical token runs and miss this entirely. - Reporter coverage. phpdup ships 12 output formats including SARIF, GitLab SAST, Prometheus, Graphviz, and time-series JSONL. None of the others ship more than three.
- Pattern-tag classification. Tagging clusters as
sql-builder,crud-handler,controller-action, etc. is unique and lets you filter / route findings by type. - Live workflow.
--watch+ TUI is for the inner-loop developer who wants feedback as they refactor; nothing else covers this. - Honesty. phpdup ships a synthetic-fuzz ground-truth scorer
(
bench/score.php) and publishes precision/recall — the others publish neither.
Where phpdup loses:
- Pure speed for exact clones. phpcpd's tokenizer is dead
simple and beats phpdup's AST + APTED pipeline on raw wall time
when all you want is "is anything copy-pasted verbatim?". Use
phpdup --exact-onlyto close most of that gap. - Cold-start RSS for tiny inputs. phpcpd peaks ~50 MB; phpdup peaks ~60–80 MB because it loads php-parser, Symfony Console, and the whole pipeline scaffolding.
- Mature IDE integration. PMD has a JetBrains plugin going back
over a decade; phpdup has none yet (the contract for a future
IntelliJ plugin lives at
docs/JETBRAINS_PLUGIN.md). If your team already lives inside PhpStorm's PMD inspections, that's a real lock-in. - Multi-language support. jscpd handles PHP, JS/TS, Python, Java, Ruby, Go, Rust, etc. from a single binary. phpdup is PHP-only by design.
- Maintenance perception. phpcpd is archived but it's the historical default — many CI pipelines still run it.
Quick chooser:
| If you need… | Reach for… |
|---|---|
| Refactor-actionable output, type-3 detection, pattern tags, any of the unique reporters | phpdup |
| A fast exact-clone gate in CI, no other features | phpcpd |
| JetBrains IDE inspection / polyglot Java + PHP project | pmd-cpd |
| Multi-language: PHP + JS + Python + … from one binary | jscpd |
| Commercial line-based diff finder | simian |
Internal scaling benchmark
phpdup's own parallel scaling on a real PHP corpus
(include/Api/: 530 files, 3,295 comparable blocks, 96 clusters):
| Configuration | Wall time | vs serial |
|---|---|---|
serial (--workers 1), cold cache |
61.13 s | 1.00× |
| 4 workers, cold cache | 30.39 s | 2.01× |
| 8 workers, cold cache | 21.11 s | 2.90× |
| 16 workers, cold cache | 17.47 s | 3.50× |
8 workers, --exact-only |
5.74 s | 10.65× |
Cluster output is byte-identical across configurations — the speedups don't come from skipping work. APTED does correct Zhang-Shasha work (slower per pair than a bounded heuristic would be); the user-visible win comes from parallelism stacking on top.
Reproduce on your own corpus:
rm -rf .phpdup-cache
/usr/bin/time -f "%e s wall, %M KB rss" \
bin/phpdup analyze /path/to/your/code \
--min-impact 100 --stats --workers 8 --no-cache
Wall-time stage breakdown (8 workers, cold cache)
| Stage | Time | Share | Implementation |
|---|---|---|---|
| Preprocess | 1.4 s | 7% | Parse + extract + normalize + fingerprint, parallel |
| Cluster | 14.3 s | 68% | APTED + parallel candidate-pair scoring |
| Refactor | 4.2 s | 20% | Anti-unify + synthesize + pattern-tag (parallel via RefactorWorker) |
| Reporting/IO | 1.2 s | 5% | |
| Total | 21.1 s | 100% | Peak RSS 464 MB |
Clustering still dominates, with the parallel TED workload itself being the largest single chunk.
Honest reporting on what doesn't pay off
- APTED alone is a serial regression. Correct, but slower than the older bounded top-down: 61 s vs 35 s on this corpus, single thread. The win has to come from parallelism stacking on top.
- Diminishing returns past 8 workers. 16 → 8 saves 4 s; the serial sections impose an Amdahl's-law ceiling at ~2× on this corpus.
- Lazy AST is currently no faster than full-memory mode at this
scale — the ~2 s reload overhead roughly cancels the small RSS
saving. The default (
lazy_ast: true) is conservative; if you have a large host, set it tofalseinphpdup.jsonfor a speed bump. - Incremental warm-cache savings are modest when the cluster phase dominates total wall time. Incremental shines when the corpus grows by a few files per run.
- Persistent cluster cache wins big on no-change re-runs — it skips Cluster + Refactor entirely. But invalidates wholesale on any block change, so it's "warm" only when no source moved.
Tuning knobs that move the needle
--workers N— parallelism level (0 = auto-detect).--exact-only— skip near-duplicate detection. 5.74 s wall on the corpus above (8 workers) — the fastest "is this clean?" gate.--similarity(default 0.80) — raising prunes more pairs before TED.--min-block-size(default 8) — fewer blocks, fewer pairs.--max-df(default 0.01) — stricter rare-gram cutoff.--auto-tune— picks the size-appropriate combination of the above based on your corpus shape; explicit flags still win.--ted-weights=semantic— weights method calls heavier than literals; slower per pair, but better at clustering behavioural near-duplicates.--no-incremental— disable per-file index snapshots.--no-lazy-ast— keep all original ASTs in RAM.
For exploratory analysis on any codebase:
bin/phpdup analyze src --auto-tune --workers $(nproc)
CI gate (exact clones only, fastest):
bin/phpdup analyze src --exact-only --min-impact 30 --workers 8
Massive monorepo (low memory budget):
bin/phpdup analyze src \ --workers $(nproc) \ --min-block-size 20 \ --similarity 0.88 # lazy_ast and incremental are on by default.
Architecture
See ARCHITECTURE.md for the full design document
including data structures, algorithm details, and the staged
implementation plan.
The project is organized as:
src/
Cli/ CLI entry point, ConfigLoader (with schema validation), Command
Pipeline/ Stage enum, PipelineState, ProgressListener,
CooperativeStageInterface, Pipeline orchestrator,
five Stage classes (Scanning/Preprocess/Cluster/Refactor/Report).
Scanning/ File walking and glob filtering
Parsing/ nikic/php-parser wrapper + AST cache
Extraction/ Block selection (with --kinds filter) + lazy AST loader
Normalization/ Three-pass canonicalization
Fingerprint/ Structural hash + n-gram bag
Index/ In-memory + inverted index
Persistence/ IndexStore (per-file block snapshots)
Similarity/ Jaccard, ContainmentSimilarity (type-3), APTED tree-edit-distance
Clustering/ Hash-bucket + union-find with type-3 containment fallback
Parallel/ WorkerPool (run + runStreaming) + Preprocess/PairScore workers
Refactor/ Anti-unification with statement-array LCS,
parameter/signature synth (incl. optional_block bools),
pattern recognition.
Reporting/ CLI / JSON / HTML / SARIF / GitLab SAST / Diff / Checkstyle
reporters + ranker
Tui/ PhpdupModel (SugarCraft Model + ProgressListener), ViewState,
Msg types (StagePumpedMsg, RestartPipelineMsg),
TuiRunner (theme resolution + Program boot, with
shared-loop support for --watch + --tui)
Watch/ WatchRunner (poll-based React\EventLoop watcher)
Util/ AST serializer, hash helpers, line range
Modules have small, documented surfaces. New normalization rules, similarity metrics, or pattern recognizers plug in without touching the rest of the pipeline.
Testing
composer test # full suite vendor/bin/phpunit --testsuite Unit vendor/bin/phpunit --testsuite Integration composer coverage:html # writes tests/phpunit/coverage-html/
The test suite (165+ tests, 469+ assertions) covers:
- Scanner glob semantics
- Normalizer canonicalization across all three modes
- N-gram fingerprint determinism + Jaccard floor on unrelated code
- ContainmentSimilarity: subset/overlap/empty cases + size-ratio guard
- APTED correctness on identical, renamed, unrelated trees, plus bounded short-circuit behavior
- WorkerPool serial path, parallel path (skipped without pcntl), empty input, CPU-count detection, and the streaming variant including child-exception propagation and array-vs-generator task returns
- IndexStore round-trip, file-change invalidation, config-key invalidation
- Pipeline orchestration: stage ordering,
stopAfterhalting at the right boundary,stageProgressreset between stages, cooperative iter() yielding pre/post-stage and mid-stage for cooperative stages - ProgressListener wiring: per-file scan + preprocess events fire, null listener is the default and never branches stage behaviour
- BlockExtractor
--kindsfilter rejects unknown kinds and keeps the ones you ask for - ConfigLoader schema validation: every field's bounds + the
cross-field
min_block_size <= max_block_sizerule + theoptional_blockssub-object - Anti-unifier hole discovery on the canonical examples + type-3 optional-block formation, max-per-cluster cap fallback, seed selection regardless of cluster.members order, observed-value remapping after the seed swap
- ParameterSynthesizer type inference (every branch) + optional_block name derivation (verb-from-segment, stop-word skip, snake-to-camel, no-identifier fallback)
- SignatureBuilder: required-then-optional ordering, default-false rendering, optional-only-cluster signatures, name fallback
- PatternRecognizer: optional-segments tag presence/absence, plus config-driven, strategy, crud-handler
- Reporters: SARIF, GitLab SAST, Diff (per-cluster + cumulative
patch), Checkstyle XML, JSON (with
present_in_members), HTML index/cluster pages — each tested against a real Pipeline run on fixture data including the type-3 optional fixture - TUI: ViewState pane focus / sort cycling, PhpdupModel key bindings,
View::progressBartaskbar updates,←/→cluster navigation, StagePumpedMsg pump / generator exhaustion, RestartPipelineMsg rebuild, theme resolution - End-to-end on a fixture corpus with expected clusters
GitHub Actions runs the full suite on every push and PR across PHP 8.1, 8.2, 8.3, 8.4, then PHPStan level 6 and Psalm on PHP 8.3, then uploads Clover coverage to Codecov.
Performance
Asymptotic complexity
| Operation | Complexity | Notes |
|---|---|---|
| File scanning | O(F) | F = file count |
| Parsing | O(L) per file | L = lines; cached on subsequent runs; parallelized |
| Block extraction | O(N) | N = AST node count |
| Normalization | O(N) | parallelized |
| Hashing | O(N) per block | parallelized |
| Hash bucketing | O(B) | B = block count |
| Inverted-index candidate | O(B × g̅) | g̅ = avg n-grams per block, with rare-gram pre-filter |
| Pairwise Jaccard | candidate-bounded | only blocks sharing rare grams; parallelized |
| Tree edit distance | bounded by (1−τ) × max(|a|,|b|) |
APTED-style Zhang-Shasha forest DP with heavy-path order; aborts early |
| Anti-unification | O(|cluster| × N) per cluster | currently serial |
| LCS for stmt arrays | O(n × m) per divergent array | bounded by optional_blocks_max_per_cluster |
Tunable knobs
min_block_size— kills boilerplate (the biggest noise source).max_block_size— caps TED work; blocks above this are dropped.max_df— rare-gram filter cutoff for candidate generation.similarity_thresholdandtree_threshold— where to draw the near-duplicate line.optional_blocks.containmentandoptional_blocks.min_overlap— type-3 detection sensitivity.workers— parallelism level.incremental/lazy_ast— re-run reuse and memory budget.
Caches
- AST cache (
<cache_dir>/parser-v5_<sha1>.cache) — serialized parse trees keyed bysha1(file) + parser_version. Re-runs with no source changes skip parsing entirely. - Index store (
<cache_dir>/<sha1(path)>.idx) — per-file block snapshots keyed by content hash + parser version + config key. Re-runs reuse blocks for unchanged files.
Both live under .phpdup-cache/ next to the project root and are
safe to delete at any time.
Roadmap
Active planning docs:
- ORM / DB-aware semantic deduplication —
docs/plans/orm-db-semantic-dedup.md. Six-option matrix (DB-call canonical token rewrite, read→mutate→save trinity collapse, behavioural call-graph similarity, symbol equivalence-class registry, IR lift, ML- learned similarity) with a recommended phased rollout. Closes the gap where today an EloquentUser::find()->save()and an equivalent raw-SQLUPDATEdon't cluster. - LSP server — deferred until the persistent cluster cache earns its keep on real workloads (it just shipped). LSP needs incremental file-change re-scoring, which the cluster cache unblocks.
- Streaming clustering for >10M-block corpora —
Phpdup\Index\ExternalSortis the on-disk K-way merge primitive; the wired-in clusterer path is the next layer of work. - VSCode extension — depends on LSP.
Architectural decisions live under
docs/adr/ and algorithm references under
docs/algorithms/. Tutorials are in
docs/tutorials/, CI integration recipes in
docs/CI.md, and the IDE-plugin / playground
contracts in docs/JETBRAINS_PLUGIN.md
FAQ
How is this different from PHPCPD?
PHPCPD finds duplicated tokens — long runs of identical lexer
output. phpdup works on the AST after canonicalization, so renamed
variables, different literals, different method names, and entire
optional code segments can all cluster together. More importantly,
phpdup tells you what the abstraction would look like — its
parameter list, types, and a suggested function name — not just
where the duplication is.
Will it rewrite my code?
No. phpdup is advisory only. It surfaces opportunities; humans
decide. Auto-rewriting was an explicit non-goal — see
ARCHITECTURE.md §1.
Does the parallel mode work on Windows / sandboxed PHP?
The worker pool detects pcntl_* availability at runtime and falls
back to a serial code path automatically. The CLI still accepts
--workers N so config files don't have to branch — the value is
ignored when pcntl is missing.
How much RAM does the cache use?
The AST cache stores serialized parser output keyed by file hash;
typical PHP file → ~5–50 KB on disk. The index store (incremental
snapshots) is ~10–100 KB per file. Both live under
.phpdup-cache/ next to your project root by default and can be
nuked at any time.
Why don't I get the threshold/role example as cleanly on my code?
Try --mode=aggressive --min-impact=30. The defaults are tuned for
quiet output on first run. Lowering --min-impact and switching to
aggressive exposes more candidates. Conversely, dropping to
default mode reduces false positives if you're getting noise.
Does it support PHP 7.x?
The tool itself requires PHP 8.1+ (uses xxh128 and constructor
property promotion). It can analyze codebases written in older PHP
versions — the parser handles 5.x and up.
Does it handle modern PHP 8.x syntax?
Yes — match expressions, enums, readonly, named arguments,
attributes, nullsafe, first-class callable syntax — all supported by
nikic/php-parser v5.
How do I re-render the demo GIFs? Install VHS, then:
docs/tapes/render.sh # render all docs/tapes/render.sh tui-live.tape # render one
VHS needs a Chromium binary. On Ubuntu 23.10+ where unprivileged user
namespaces are restricted (and /usr/bin/chromium-browser is the
Snap-only stub), pass CHROME_BIN=/path/to/chrome — Playwright's
chrome-headless-shell works well — and the script bootstraps a
--no-sandbox shim on PATH.
The TUI looks empty / didn't appear when I added --tui.
The dashboard requires a real TTY for keyboard input and rendering.
If your terminal isn't a real TTY (CI, piped stdout, redirected
file) the SugarCraft Program will fail to attach to the input
stream — use --plain instead. The pipeline now runs inside the
runtime, so the panes update live as work progresses; if all the
panes are still at zero something likely failed during pipeline init
(run bin/phpdup analyze … --plain to see the error message that
the TUI might be hiding).
Contributing
PRs welcome. Please:
- Run
composer testand add tests for new functionality. - Match the existing PHP coding style (
declare(strict_types=1), PSR-4, constructor property promotion, narrow public surfaces). - For new normalization rules, similarity metrics, or pattern recognizers, document the rule in the relevant module's docblock.
For larger architectural changes, open an issue first to discuss the design.
License
MIT — see LICENSE.






