sanix-darker / cisv
High-performance CSV parser with SIMD optimizations (AVX-512/AVX2) for PHP - 10-50x faster than fgetcsv
Package info
Language:C
Type:php-ext
Ext name:ext-cisv
pkg:composer/sanix-darker/cisv
- dev-master
- v0.3.3
- v0.3.2
- v0.3
- v0.2.6
- v0.2.5
- v0.2.4
- v0.2.3
- v0.2.2
- v0.2.1
- v0.2.0
- v0.1.9
- v0.1.8
- v0.1.7
- v0.1.6
- v0.1.5
- v0.1.4
- v0.1.3
- v0.1.2
- v0.1.1
- v0.1.0
- v0.0.81
- v0.0.80
- v0.0.79
- v0.0.78
- v0.0.77
- v0.0.76
- v0.0.75
- v0.0.74
- v0.0.73
- v0.0.72
- v0.0.71
- v0.0.70
- v0.0.69
- v0.0.67
- v0.0.66
- v0.0.65
- 0.0.64
- 0.0.63
- v0.0.61
- dev-feat/cisv-phaseA-fixes
- dev-dependabot/npm_and_yarn/bindings/nodejs/node-addon-api-8.6.0
- dev-dependabot/npm_and_yarn/bindings/nodejs/types/node-25.3.2
- dev-dependabot/npm_and_yarn/bindings/nodejs/nyc-18.0.0
- dev-fix/multiline-csv-108
- dev-dependabot/github_actions/actions/download-artifact-7
- dev-dependabot/github_actions/actions/upload-artifact-6
- dev-dependabot/github_actions/actions/checkout-6
- dev-dependabot/github_actions/actions/setup-python-6
- dev-dependabot/github_actions/actions/setup-node-6
- dev-f/parallel-parse-fixes
- dev-f/hight-c-py-bindings
- dev-feat/robust-optimization-improvements
- dev-fix/security-issues
- dev-fix/quote-field-parsing
- dev-f/packagist-update
- dev-fix-php-bindings-composer
- dev-php-bindings-composer
This package is auto-updated.
Last update: 2026-03-02 23:29:05 UTC
README
High-performance CSV parser and writer with SIMD optimizations (AVX-512/AVX2). Available as a C library, CLI tool, and bindings for Node.js, Python, and PHP.
Blog post: How I accidentally created the fastest CSV parser ever made
FEATURES
- SIMD-accelerated parsing (AVX-512, AVX2, with scalar fallback)
- Zero-copy memory-mapped file processing
- Streaming API for large files
- Row-by-row iterator API (fgetcsv-style) with early exit support
- Configurable delimiters, quotes, escapes, and comments
- Field transforms (uppercase, lowercase, trim, hash, base64, etc.)
- Parallel chunk processing for multi-threaded parsing
- RFC 4180 compliant with relaxed mode option
- CSV writer with SIMD optimization
PERFORMANCE
Benchmarks on Intel Core i5-8365U:
| Test | Throughput |
|---|---|
| Simple CSV | 165-290 MB/s |
| Quoted fields | 175-189 MB/s |
| Large fields | 153-169 MB/s |
Benchmark reports are generated by scripts/benchmark_cli.sh in CI and committed per binding:
BENCHMARKS_CLI.mdBENCHMARKS_NODEJS.mdBENCHMARKS_PYTHON.mdBENCHMARKS_PHP.md
INSTALLATION
BUILD FROM SOURCE
git clone https://github.com/sanix-darker/cisv
cd cisv
make all
INSTALL CLI
sudo make install-cli
NODE.JS
cd bindings/nodejs
npm install
npm run build
PYTHON
cd bindings/python pip install -e .
PHP
cd bindings/php
phpize
./configure
make
sudo make install
CLI USAGE
cisv [OPTIONS] FILE Options: -d, --delimiter CHAR Field delimiter (default: ,) -q, --quote CHAR Quote character (default: ") -e, --escape CHAR Escape character -m, --comment CHAR Comment line prefix -t, --trim Trim whitespace from fields -r, --relaxed Relaxed parsing mode -c, --count Count rows only -s, --select COLS Select columns by index -o, --output FILE Output file -b, --benchmark Benchmark mode --skip-empty Skip empty lines --skip-errors Skip lines with errors --from-line N Start at line N --to-line N Stop at line N --head N First N rows --tail N Last N rows
EXAMPLES
# Parse CSV cisv data.csv # Parse TSV cisv -d $'\t' data.tsv # Count rows cisv -c data.csv # Select columns 0, 2, 5 cisv -s 0,2,5 data.csv # Benchmark cisv -b data.csv # Parse lines 100-500 cisv --from-line 100 --to-line 500 data.csv
CSV WRITER
cisv write [OPTIONS]
Options:
-g, --generate N Generate N test rows
-o, --output FILE Output file
-d, --delimiter CHAR Field delimiter
-Q, --always-quote Quote all fields
-r, --crlf Use CRLF line endings
-n, --null TEXT NULL representation
DOCKER BENCHMARKS
docker build -t cisv-benchmark . docker run --rm --cpus="2.0" --memory="4g" cisv-benchmark
Per-component benchmark environments (useful when local tools are missing):
# CLI tools benchmark docker build -t cisv-cli-bench -f cli/benchmarks/Dockerfile . docker run --rm --platform linux/amd64 --cpus=2 --memory=4g cisv-cli-bench # Node.js benchmark docker build -t cisv-node-bench -f bindings/nodejs/benchmarks/Dockerfile . docker run --rm --platform linux/amd64 --cpus=2 --memory=4g cisv-node-bench # PHP benchmark docker build -t cisv-php-bench -f bindings/php/benchmarks/Dockerfile . docker run --rm --platform linux/amd64 --cpus=2 --memory=4g cisv-php-bench # Python (nanobind) benchmark docker build -t cisv-pynb-bench -f bindings/python-nanobind/benchmarks/Dockerfile . docker run --rm --platform linux/amd64 --cpus=2 --memory=4g cisv-pynb-bench
C API
#include <cisv/parser.h> #include <cisv/writer.h> #include <cisv/transformer.h> // Callbacks void on_field(void *user, const char *data, size_t len); void on_row(void *user); // Parse file cisv_config cfg; cisv_config_init(&cfg); cfg.field_cb = on_field; cfg.row_cb = on_row; cfg.delimiter = ','; cisv_parser *p = cisv_parser_create_with_config(&cfg); cisv_parser_parse_file(p, "data.csv"); cisv_parser_destroy(p); // Count rows (fast mode) size_t count = cisv_parser_count_rows("data.csv"); // Row-by-row iterator (fgetcsv-style, supports early exit) cisv_iterator_t *it = cisv_iterator_open("data.csv", &cfg); const char **fields; const size_t *lengths; size_t field_count; while (cisv_iterator_next(it, &fields, &lengths, &field_count) == CISV_ITER_OK) { // Process row - fields valid until next call if (should_stop) break; // Early exit supported } cisv_iterator_close(it); // Parallel processing cisv_mmap_file_t *file = cisv_mmap_open("data.csv"); int chunk_count; cisv_chunk_t *chunks = cisv_split_chunks(file, 4, &chunk_count); // Parse chunks in parallel threads cisv_mmap_close(file);
NODE.JS API
const { cisvParser } = require('cisv'); // Parse file const parser = new cisvParser({ delimiter: ',', quote: '"', trim: true }); const rows = parser.parseSync('data.csv'); // Parse with transforms parser.transform(0, 'uppercase') .transform(1, 'trim') .transform(2, (value) => value.toUpperCase()); const transformedRows = parser.parseSync('data.csv'); // Name-based transforms parser.setHeaderFields(['id', 'name', 'email']); parser.transformByName('name', 'uppercase'); // Stream parsing const streamParser = new cisvParser(); streamParser.write(chunk); streamParser.end(); const results = streamParser.getRows(); // Row-by-row iteration (memory efficient, supports early exit) const iterParser = new cisvParser({ delimiter: ',' }); iterParser.openIterator('large.csv'); let row; while ((row = iterParser.fetchRow()) !== null) { console.log(row); if (row[0] === 'stop') break; // Early exit } iterParser.closeIterator(); // Fast row counting const total = cisvParser.countRows('large.csv');
PYTHON API
import cisv # Parse file rows = cisv.parse_file('data.csv', delimiter=',', trim=True) # Parse string rows = cisv.parse_string('a,b,c\n1,2,3') # Count rows (fast) count = cisv.count_rows('data.csv') # Row-by-row iteration (memory efficient, supports early exit) with cisv.CisvIterator('large.csv') as reader: for row in reader: print(row) if row[0] == 'stop': break # Early exit - no wasted work # Or use the convenience function for row in cisv.open_iterator('data.csv'): process(row)
PHP API
<?php $parser = new CisvParser([ 'delimiter' => ',', 'quote' => '"', 'trim' => true ]); // Parse entire file $rows = $parser->parseFile('data.csv'); // Row-by-row iteration (memory efficient, supports early exit) $parser->openIterator('large.csv'); while (($row = $parser->fetchRow()) !== false) { print_r($row); if ($row[0] === 'stop') break; // Early exit } $parser->closeIterator(); // Count rows (fast) $count = CisvParser::countRows('data.csv');
TRANSFORMS
Available transforms for field processing:
| Transform | Description |
|---|---|
| uppercase | Convert to uppercase |
| lowercase | Convert to lowercase |
| trim | Remove leading/trailing whitespace |
| trim_left | Remove leading whitespace |
| trim_right | Remove trailing whitespace |
| to_int | Parse as integer |
| to_float | Parse as float |
| to_bool | Parse as boolean |
| hash_md5 | MD5 hash |
| hash_sha256 | SHA-256 hash |
| base64_encode | Base64 encode |
| base64_decode | Base64 decode |
| url_encode | URL encode |
| url_decode | URL decode |
CONFIGURATION OPTIONS
| Option | Default | Description |
|---|---|---|
| delimiter | , | Field separator |
| quote | " | Quote character |
| escape | (RFC4180) | Escape character |
| comment | none | Comment line prefix |
| trim | false | Trim field whitespace |
| skip_empty_lines | false | Skip empty lines |
| relaxed | false | Relaxed parsing mode |
| max_row_size | 0 | Max row size (0 = unlimited) |
| from_line | 0 | Start line (1-based) |
| to_line | 0 | End line (0 = all) |
| skip_lines_with_error | false | Skip error lines |
BUILDING WITH PGO
Profile-Guided Optimization for best performance:
# Generate profile cd core/build cmake -DCISV_PGO_GENERATE=ON .. make ./cisv -b large_file.csv # Build with profile cmake -DCISV_PGO_USE=ON .. make
ARCHITECTURE
- SIMD: AVX-512 (64-byte) or AVX2 (32-byte) vector processing
- Memory: Zero-copy mmap with kernel read-ahead hints
- Buffer: 1MB ring buffer sized for L3 cache
- Fallback: SWAR (SIMD Within A Register) for non-AVX systems
- Security: Integer overflow protection, bounds checking
REQUIREMENTS
- Linux/macOS (x86_64 or ARM64)
- GCC 7+ or Clang 6+
- CMake 3.14+
- Node.js 14+ (for Node.js binding)
- Python 3.7+ (for Python binding)
- PHP 7.4+ (for PHP extension)
LIMITATIONS
- Windows support planned for future release (maybe...)
LICENSE
MIT
AUTHOR
ACKNOWLEDGMENTS
Inspired by: