mfonte / sparsedigest
Fast PHP file hashing library using intelligent sampling. Reads only selected portions of large binary files instead of full content, dramatically reducing I/O overhead. Automatically falls back to full-file hashing for text files. Deterministic, non-cryptographic, optimized for change detection.
Installs: 2
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/mfonte/sparsedigest
Requires
- php: >=7.2
- ext-mbstring: *
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.4
- php-coveralls/php-coveralls: ^2.7
- phpbench/phpbench: ^0.17 || ^1.2
- phpstan/phpstan: ^1.12
- phpunit/phpunit: ^8
Suggests
- symfony/polyfill-mbstring: Install symfony/polyfill-mbstring if your environment lacks support for the mbstring extension
README
SparseDigest is a PHP library that provides fast file hashing by intelligently sampling chunks of data instead of hashing the entire content.
Tip
In a hurry? Go to Installation and Usage Examples to get started quickly!
SparseDigest is a sort-of Locality-Sensitive Hashing algorithm for file hashing, a subset of fuzzy hashing. Some concepts have also been taken from Rolling Hashes.
It's designed to be significantly faster than traditional full-file hashing, especially for large files, by using a technique called sampled hashing. This involves hashing only specific portions (samples) of the file, rather than the entire content.
For large files (typically binary or compressed), the sampling approach dramatically reduces I/O overhead. For text-based files, where even a minor modification outside the sampled regions may be critical, SparseDigest automatically falls back to a full-file hash.
The implementation is deterministic. On the same input, with the same sampling parameters, it will always produce the same hash.
Computational complexity is sub-linear relative to file size. While larger files require slightly more time (more sample offsets to compute), the time increase is much smaller than proportional to file size. For example, a 100MB file takes only ~4-5× longer than a 1MB file to hash, rather than 100× longer with full-file hashing.
For a detailed performance comparison of Sampled Hashing vs Native Hashing (Full Hash), refer to PERFORMANCE.md.
Performance Highlights
Based on real benchmarks (PHP 8.3, PHPBench 1.2.14):
| Scenario | SparseDigest | Native hash_file | Winner |
|---|---|---|---|
| 1MB binary file | ~37μs | ~111μs (xxh3) | ✅ SparseDigest 3× faster |
| 5MB binary file | ~174μs | ~806μs (xxh3) | ✅ SparseDigest 4.6× faster |
| 10MB binary file | ~399μs | ~1.8ms (xxh3) | ✅ SparseDigest 4.5× faster |
| 100MB binary file | ~4.8ms | ~18.2ms (xxh3) | ✅ SparseDigest 3.8× faster |
| Batch of large media (3×100MB) | ~12.4ms | ~47ms | ✅ SparseDigest 3.8× faster |
| Repeated hashing (5×) | ~28.3ms | ~110.7ms | ✅ SparseDigest 3.9× faster |
| Small files (<500KB) | ~16μs | ~13μs | ❌ Native 1.3× faster |
Key insights:
- Break-even point: ~500KB-1MB — SparseDigest becomes faster than native
hash_file()for files larger than this threshold. - Constant memory usage: ~1.85MB — Memory is now constant regardless of file size, thanks to incremental hashing.
- Algorithm choice is irrelevant for sparse hashing — With only 10% of data processed, all algorithms perform similarly.
- Massive speedup on large files — Up to 5× faster than native
hash_file(xxh3)on typical file sizes.
Important
When to use native hash_file() instead:
- Files smaller than 500KB
- Text/document files (library falls back anyway)
- Cryptographic integrity requirements
TL;DR
composer require mfonte/sparsedigest
A minimum PHP 7.2 environment is required, with mbstring support preferred. Anyway, symfony/polyfill-mbstring is used to provide fallbacks for mbstring functions.
Minimal usage example:
use Mfonte\SparseDigest\SparseHasher; use Mfonte\SparseDigest\FileHasher; // Sparse hashing (fast, samples only portions of the file) $sd = SparseHasher::new('/path/to/file'); $sparseHash = $sd->hash(); echo "Sparse Hash: " . $sparseHash->toHex() . "\n"; // Full-file hashing (reads entire file, cryptographically accurate) $fd = FileHasher::new('/path/to/file'); $fullHash = $fd->hash(); echo "Full Hash: " . $fullHash->toHex() . "\n";
For a complete list of examples, refer to the Usage Examples section.
Key Features
-
Sparse Hashing (
SparseHasher):- hash(): Computes a sampled hash using the fastest available algorithm. Accepts
coverage(1-100%) andsampleSizeparameters. - hashAs($algo): Same as
hash(), but uses the specified algorithm for the output representation. - For files detected as text, or too small to be sampled, automatically falls back to full-file hashing.
- hash(): Computes a sampled hash using the fastest available algorithm. Accepts
-
Full Hashing (
FileHasher):- hash(): Computes a complete hash of the entire file using the fastest available algorithm.
- hashAs($algo): Same as
hash(), but uses the specified algorithm (e.g., 'sha256', 'md5'). - Use this when you need cryptographic accuracy or must detect any single-byte change.
-
File Abstraction (
File):
Low-level class for file operations: reading, hashing, MIME detection, and binary/text detection. -
Performance:
Dramatically reduced I/O overhead for large binary files through intelligent sampling.
Visual Representation: Full-File Hash vs. Sampled Hash (240KB File)
File Size: 245760 bytes (can be divided into 60 blocks of 4096 bytes each)
Coverage: 10% (default) — determines how many blocks are actually sampled
Sample Size: 4096 bytes (default block size)
Calculation:
- Total sample bytes = 245760 × 10% = 24576 bytes
- Total blocks to sample = 24576 / 4096 = 6 blocks
- Distribution: 2 first blocks + 2 last blocks + 2 intermediate blocks
Legend:
██: Data block used in the hash calculation (4096 bytes each)░░: Data block not used in the hash calculation
Full-File Hashing (60 blocks × 4096 bytes = 245760 bytes)
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ Total data read: 245760 bytes (all 60 blocks)
Sampled Hashing (6 blocks at 10% coverage)
████░░░░░░░░░░░░░░░░░░░░██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░██░░░░░░░░░░████ | | | | | | First 2 blocks Intermediate 1 Interm. 2 Last 2 Total data read: 24576 bytes (~10% of file, 6 blocks)
Warning
Caveat about change detection:
Because sparse hashing only reads sampled regions, modifications in non-sampled areas will NOT be detected. This is by-design: the algorithm trades detection granularity for speed. If your use case requires detecting every single-byte change anywhere in the file, use full-file hashing instead (FileHasher::new($file)->hash()).
It is evident that the sampled hashing technique relies on fractions of a file, assuming that relevant changes are likely to be captured in the sampled regions. This approach is particularly effective for these scenarios:
- Compressed Archives: Where a minor change can result in a completely different file.
- Compressed Images (e.g., JPEG): Where a small change can affect the entire image (think about the metadata, or a single pixel change in a large image).
- Compressed Videos: Where a minor edit can alter the entire video.
- Large Binary Files: Where the cost of reading the entire file is prohibitive, and you can tolerate a small chance of false negatives.
Underlying Logic of Sampled Hashing in SparseDigest
SparseDigest's hashing strategy is built around the concept of sampled hashing. Instead of processing every byte of a file, the algorithm reads only selected portions, significantly reducing I/O overhead for large files. The following outlines the precise, step‐by‐step logic of the implementation:
1. File Sampling Determination
-
Minimum File Size Check:
The file is deemed too small to sample based on thecoveragepercentage andsampleSize. The algorithm calculates:totalSampleBytes = floor(fileSize × coverage / 100), thentotalBlocks = floor(totalSampleBytes / sampleSize). IftotalBlocks < 4, the file is too small for sparse hashing and falls back to full-file hashing. -
Binary File Detection:
Prior to sampling, the file is examined via theisLikelyBinary()method, which reads the first 512 bytes and applies a fast heuristic:- If any null byte (
\0) is found, the file is immediately classified as binary (covers images, videos, archives, executables). - If control characters (ASCII < 9 or 14-31) are found, the file is classified as binary.
- Otherwise, the file is considered text.
If the file is detected as text, the library falls back to full-file hashing to ensure that even subtle modifications (which might occur outside the sampled regions) are caught.
- If any null byte (
2. Sample Selection and Offset Calculation
-
Fixed Edge Blocks:
The algorithm always includes the first two blocks (starting at offset 0 and offset = sampleSize) and the last two blocks (offsets atfileSize − 2 × sampleSizeandfileSize − sampleSize). These edge blocks typically contain file headers and footers, which are critical in many binary formats. -
Evenly Distributed Intermediate Samples:
The number of intermediate samples is calculated as:intermediateCount = totalBlocks - 4(subtracting the 4 fixed edge blocks). These intermediate blocks are evenly distributed between the fixed edge blocks. This ensures that the sampled regions provide a representative snapshot of the file's interior content.
3. Reading Data from Each Sample Block
- Direct Block Reading:
Each sample block is read directly viareadBytesAtOffset(), which performs a simplefseek()+fread()operation. This approach is optimized for speed:- No preprocessing or transformation is applied to the raw bytes.
- The data is read exactly as stored on disk.
- This minimizes I/O overhead and CPU processing time.
4. Incremental Hashing
-
Single-Pass Incremental Hashing:
Instead of computing intermediate hashes and concatenating them, SparseDigest uses PHP's incremental hashing API (hash_init,hash_update,hash_final):- A hash context is initialized with the selected algorithm.
- Each sample block is fed directly to
hash_update()as it's read. - The file size is appended as the final piece of data to enhance uniqueness.
hash_final()produces the final digest.
This approach eliminates intermediate string allocations and reduces memory usage to a constant ~1.85MB regardless of file size.
5. Determinism and Performance
- Determinism:
Given the same file and the same sampling parameters (coverageandsampleSize), the algorithm always selects the same sample offsets and produces the same hash output. - Performance:
By only reading a fraction of the file (determined bycoverage), the I/O overhead is drastically reduced. With the default 10% coverage, SparseDigest is 3-5× faster than nativehash_file()for files larger than ~500KB. Memory usage is constant at ~1.85MB regardless of file size.
6. Non-Cryptographic Nature
It is critical to note that while SparseDigest is optimized for speed and efficiency in detecting file changes, it is not designed for cryptographic security. The sampled hash is a heuristic representation and may not be suitable for applications requiring cryptographic integrity.
Summary
SparseDigest implements a sampled hashing strategy that:
- Selects a fixed set of edge samples (first 2 and last 2 blocks) and evenly distributes intermediate samples based on
coverage. - Reads each sample block directly without preprocessing, feeding data incrementally to the hash function.
- Uses incremental hashing (
hash_init/hash_update/hash_final) for constant memory usage. - Automatically falls back to full-file hashing for text-based files or files that are too small to sample.
- Guarantees determinism with minimal I/O overhead, making it ideal for large binary files where performance is a priority.
- Does not detect changes in non-sampled regions — this is an intentional trade-off for speed.
This highly opinionated approach prioritizes speed and practical consistency over cryptographic security, offering an efficient method to detect file changes while balancing I/O performance and data uniqueness.
Installation
Install via Composer:
composer require mfonte/sparsedigest
Minimum requirements:
- PHP 7.2 or higher
- mbstring support (preferred, but not mandatory)
- Nothing else!
Usage Examples
Sparse Hashing (Sampled Hash)
use Mfonte\SparseDigest\SparseHasher; $sd = SparseHasher::new('/path/to/file'); // Uses default sampling parameters (10% coverage, 4096 byte blocks) // Returns a HashResult using the fastest available algorithm $digest = $sd->hash(); echo "Sparse Hash: " . $digest->toHex() . "\n"; // With custom sampling parameters (20% coverage, 8192 byte blocks) $digest = $sd->hash(20, 8192); echo "Sparse Hash (20% coverage): " . $digest->toHex() . "\n"; // With a specific output algorithm (SHA256) $digest = $sd->hashAs('sha256'); echo "Sparse Hash (SHA256): " . $digest->toHex() . "\n"; // Combining custom sampling and algorithm $digest = $sd->hashAs('md5', 15, 2048); echo "Sparse Hash (MD5, 15% coverage): " . $digest->toHex() . "\n";
NOTE: The hash output will differ when changing sampling parameters. This is expected behavior.
Heads up!
If you intend to persist sparse hashes (for example, in a database), remember to store the sampling parameters used. For re-hashing and comparison, you should use the same
coverageandsampleSizevalues to obtain consistent results.
Full File Hashing
For cryptographic accuracy or when you need to detect any single-byte change, use FileHasher:
use Mfonte\SparseDigest\FileHasher; $fd = FileHasher::new('/path/to/file'); // Full hash with the fastest available algorithm $digest = $fd->hash(); echo "Full Hash: " . $digest->toHex() . "\n"; // Full hash with a specific algorithm $digest = $fd->hashAs('sha256'); echo "Full Hash (SHA256): " . $digest->toHex() . "\n"; // Full hash with MD5 $digest = $fd->hashAs('md5'); echo "Full Hash (MD5): " . $digest->toHex() . "\n";
Low-Level File Operations with File
The File class provides low-level access to file operations:
use Mfonte\SparseDigest\File; $file = File::new('/path/to/file'); // File information echo "Size: " . $file->getSize() . " bytes\n"; echo "MIME: " . $file->getMimeType() . "\n"; echo "Binary: " . ($file->isLikelyBinary() ? 'yes' : 'no') . "\n"; // Full-file hash (alternative to FileHasher) $digest = $file->hashWith('xxh3'); echo "Hash: " . $digest->toHex() . "\n";
Order of preference for hashing algorithms
The library will use the fastest available algorithm from the following list, in ordered sequence (the first, the fastest):
'xxh3', 'xxh128', 'crc32c', 'xxh64', 'crc32', 'crc32b', 'murmur3f', 'xxh32',
'murmur3c', 'murmur3a', 'adler32', 'md4', 'fnv1a64', 'fnv132', 'fnv164', 'fnv1a32',
'joaat', 'md5', 'tiger128,3', 'tiger160,3', 'tiger192,3', 'sha1', 'tiger160,4',
'tiger128,4', 'tiger192,4', 'ripemd128', 'ripemd256', 'sha3-224', 'sha3-256',
'sha384', 'sha512/224', 'sha512/256', 'sha512', 'ripemd320', 'ripemd160', 'sha3-384',
'haval128,3', 'haval160,3', 'haval192,3', 'haval224,3', 'haval256,3', 'sha224',
'sha256', 'sha3-512', 'haval128,4', 'haval160,4', 'haval192,4', 'haval224,4',
'haval256,4', 'haval128,5', 'haval160,5', 'haval192,5', 'haval224,5', 'haval256,5',
'whirlpool', 'gost', 'gost-crypto', 'snefru', 'snefru256', 'md2'
This pre-defined list has been derived from https://php.watch/articles/php-hash-benchmark.
Testing and Benchmarking
Run the test suite with:
composer test
Run the full benchmark suite with:
composer bench
Run specific benchmark groups with:
composer bench:core composer bench:scaling composer bench:algorithms composer bench:realworld
Changelog
Please see CHANGELOG for more information on what has changed recently.
Contributing
Please see CONTRIBUTING for details.
License
The MIT License (MIT). Please see License File for more information.