scienide / helix
A library for counting short DNA sequences for use in Bioinformatics.
Fund package maintenance!
andrewdalpino
Requires
- php: >=7.4
- scienide/okbloomer: ^1.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- phpbench/phpbench: ^1.0
- phpstan/phpstan: ^0.12.88
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2022-02-22 08:26:32 UTC
README
A library for counting short DNA sequences for use in Bioinformatics. Helix consists of tools for data extraction as well as an ultra-low memory hash table called DNA Hash specialized for counting DNA sequences. DNA Hash stores sequence counts by their up2bit encoding - a two-way hash that exploits the fact that each DNA base need only 2 bits to be fully encoded. Accordingly, DNA Hash uses less memory than a lookup table that stores raw gene sequences. In addition, DNA Hash's layered Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.
- Ultra-low memory footprint
- Compatible with FASTA and FASTQ formats
- Supports canonical sequence counting
- Open-source and free to use commercially
Note: The maximum sequence length is platform dependent. On a 64-bit machine, the max length is 31. On a 32-bit machine, the max length is 15.
Note: Due to the probabilistic nature of the Bloom filter, DNA Hash may over count sequences at a bounded rate.
Installation
Install into your project using Composer:
$ composer require scienide/helix
Requirements
- PHP 7.4 or above
Example
use Helix\DNAHash; use Helix\Extractors\FASTA; use Helix\Tokenizers\Canonical; use Helix\Tokenizers\Kmer; $extractor = new FASTA('example.fa'); $tokenizer = new Canonical(new Kmer(25)); $hashTable = new DNAHash(0.001); foreach ($extractor as $sequence) { $tokens = $tokenizer->tokenize($sequence); foreach ($tokens as $token) { $hashTable->increment($token); } } $top10 = $hashTable->top(10); print_r($top10);
Array
(
[GCTATAAAAAGAAAATTTTGGAATA] => 19
[ATTCCAAAATTTTCTTTTTATAGCC] => 19
[TAAAAAGAAAATTTTGGAATAAAAA] => 18
[ATAAAAAGAAAATTTTGGAATAAAA] => 18
[TATAAAAAGAAAATTTTGGAATAAA] => 18
[CTATAAAAAGAAAATTTTGGAATAA] => 18
[AAATAATTTCAATTTTCTATCTCAA] => 17
[AAAATAATTTCAATTTTCTATCTCA] => 17
[CAAAATAATTTCAATTTTCTATCTC] => 17
[AGATAGAAAATTGAAATTATTTTGA] => 17
)
Testing
To run the unit tests:
$ composer test
Static Analysis
To run static code analysis:
$ composer analyze
Benchmarks
To run the benchmarks:
$ composer benchmark
References
- [1] https://github.com/JohnLonginotto/ACGTrie/blob/master/docs/UP2BIT.md.
- [2] P. Melsted et al. (2011). Efficient counting of k-mers in DNA sequences using a bloom filter.
- [3] S. Deorowicz et al. (2015). KMC 2: fast and resource-frugal k-mer counting.