jbzoo/csv-blueprint

CLI Utility for Validating and Generating CSV files based on custom rules. It ensures your data meets specified criteria, streamlining data management and integrity checks.

Installs: 2 482

Dependents: 1

Suggesters: 0

Security: 0

Stars: 13

Watchers: 0

Forks: 0

Type:project

1.0.0 2024-04-19 20:27 UTC

README

CI CI Psalm Coverage Coverage Bugs Code smells Docker Pulls

Static Badge Static Badge Static Badge Static Badge Static Badge

Strict and automated line-by-line CSV validation tool based on customizable Yaml schemas.

In seconds, make sure every char in a gigabyte file meets your expectations.

Intro

I believe it is the simplest yet flexible and powerful CSV validator in the world. ☺️

Features

  • Just create a simple and friendly Yaml with your CSV schema and the tool will validate your files line by line. You will get a very detailed report with row, column and rule accuracy.
  • Out of the box, you have access to over 330 validation rules that can be combined to control the severity of validation.
  • You can validate each value (like, date has a strict format on each line), or the entire column (like, median of all values is within limits). It's up to you to choose the severity of the rules.
  • Use it anywhere as it is packaged in Docker or even as part of your GitHub Actions.
  • Create a CSV in your pipelines/ETL/CI and ensure that it meets the most stringent expectations.
  • Prepare your own libraries with complex rules using presets. This will help you work with hundreds of different files at the same time.
  • Create schema on the fly based on an existing CSV file and also analyze data in CSV - find out what is stored in your file and get a summary report.
CLICK to see a typical cross-team workflow
  1. Data Preparation: Team A generates CSV data adhering to a predefined format and places the file in a shared location accessible to Team B (e.g., a shared repository or cloud storage).

  2. Notification: Team A notifies Team B that the data is ready via corporate communication channels (email, chat, task management system).

  3. Validation: Team B uses predefined validation rules stored in the repository to check the CSV file for accuracy and integrity before importing. This includes verifying date formats, numerical values, and the presence of required columns.

  4. Data Import: After successful validation, Team B imports the data from the CSV file into their system for further processing.

  5. Error Handling: If validation identifies errors, the process halts, and Team B provides feedback to Team A for data correction.

Why Validation is Necessary:

  • Reduce Errors: Validating data before import minimizes the likelihood of errors, enhancing data quality.
  • Efficiency: Prevents time loss on manual error correction post-import.
  • Data Consistency: Ensures data meets the expectations and requirements of Team B, facilitating accurate processing and analysis.
  • Automation: Storing validation rules in the repository eases the process of checking automation and simplifies updating validation criteria.

Live demo

As a live demonstration of how the tool works, you can explore the super minimal repository at demo. For more complex examples and various reporting methods, take a look at the demo pipeline with different reports types.

See also:

Table of content

Usage

Docker container

Ensure you have Docker installed on your machine.

# Pull the Docker image
docker pull jbzoo/csv-blueprint:latest

# Run the tool inside Docker
docker run --rm                                  \
    --workdir=/parent-host                       \
    -v $(pwd):/parent-host                       \
    jbzoo/csv-blueprint:latest                   \
    validate-csv                                 \ # See available commands and options below.
    --csv=./tests/fixtures/demo.csv              \ # Your CSV(s).
    --schema=./tests/schemas/demo_invalid.yml    \ # Your schema(s).
    --ansi

# OR build it from source.
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make docker-build  # local tag is "jbzoo/csv-blueprint:local"

GitHub Action

You can find launch examples in the workflow demo.

- uses: jbzoo/csv-blueprint@master # See the specific version on releases page. `@master` is latest.
  with:
    # Specify the path(s) to the CSV files you want to validate.
    #   This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
    #   Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
    csv: './tests/**/*.csv'

    # Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
    #   Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
    #   Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
    schema: './tests/**/*.yml'

    # Report format. Available options: text, table, github, gitlab, teamcity, junit.
    # Default value: 'table'
    report: 'table'

    # Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
    #   Available options:
    #   auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
    #   yes: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
    #   no: Apply only schemas with not empty `filename_pattern` and match the CSV files.
    # Default value: 'auto'
    apply-all: 'auto'

    # Quick mode. It will not validate all rows. It will stop after the first error.
    # Default value: 'no'
    quick: 'no'

    # Skip schema validation. If you are sure that the schema is correct, you can skip this check.
    # Default value: 'no'
    skip-schema: 'no'

    # Extra options for the CSV Blueprint. Only for debbuging and profiling.
    # Available options:
    #   Add flag `--parallel` if you want to validate CSV files in parallel.
    #   Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance.
    #   Add flag `--debug` if you want to see more really deep details.
    #   Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
    #   Verbosity level: Available options: `-v`, `-vv`, `-vvv`
    #   ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
    # Default value: 'options: --ansi'
    # You can skip it.
    extra: 'options: --ansi'

Phar binary

CLICK to see using PHAR file

Ensure you have PHP installed on your machine.

# Just download the latest version
wget https://github.com/jbzoo/csv-blueprint/releases/latest/download/csv-blueprint.phar
chmod +x ./csv-blueprint.phar
./csv-blueprint.phar validate-csv               \
   --csv=./tests/fixtures/demo.csv              \
   --schema=./tests/schemas/demo_invalid.yml

# OR create project via Composer (--no-dev is optional)
composer create-project --no-dev jbzoo/csv-blueprint
cd ./csv-blueprint
./csv-blueprint validate-csv                    \
    --csv=./tests/fixtures/demo.csv             \
    --schema=./tests/schemas/demo_invalid.yml

# OR build from source
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make build
./csv-blueprint validate-csv                    \
    --csv=./tests/fixtures/demo.csv             \
    --schema=./tests/schemas/demo_invalid.yml

Schema definition

Define your CSV validation schema in YAML for clear and structured configuration. Alternative formats are also supported: JSON and PHP, accommodating various preferences and workflow requirements.

The provided example illustrates a schema for a CSV file with a header row. It mandates that the id column must not be empty and should only contain integer values. Additionally, the name column is required to have a minimum length of 3 characters, ensuring basic data integrity and usefulness.

Example schema in YAML

name: Simple CSV Schema
filename_pattern: /my-favorite-csv-\d+\.csv$/i
csv:
  delimiter: ';'

columns:
  - name: id
    rules:
      not_empty: true
      is_int: true
    aggregate_rules:
      is_unique: true
      sorted: [ asc, numeric ]

  - name: name
    rules:
      length_min: 3
    aggregate_rules:
      count: 10

Full schema description

In the example YAML file, a detailed description of all features is provided. This documentation is verified through automated tests, ensuring it remains current.

Notes:

  • The traditional typing of columns (e.g., type: integer) has been intentionally omitted in favor of rules. These rules can be sequenced and combined freely, offering extensive flexibility for CSV file validation.
  • All options are optional unless stated otherwise. You have the liberty to include or omit them as you see fit.
  • Specifying an incorrect rule name, using non-existent values (not listed below), or assigning an incompatible variable type for any option will result in a schema validation error. To bypass these errors, you may opt to use the --skip-schema flag at your discretion, allowing the use of your custom keys in the schema.
  • All rules ignore the empty string except not_empty. It doesn't ignore empty strings (length 0). To enforce a non-empty value, apply not_empty: true. Note that a single space counts as a character, making the string length 1. To prevent such scenarios, include is_trimmed: true.
  • Rules operate independently; they have no knowledge of or influence over one another.
  • When a rule's value is is_some_rule: true, it merely serves as an activation toggle. Other values represent rule parameters.
  • The sequence of rule execution follows their order in the schema, affecting only the order of error messages in the report.
  • Unless explicitly stated, most rules are case-sensitive.
  • As a fallback, the regex rule is available. However, using clear rule combinations is recommended for greater clarity on validation errors.

Below is a comprehensive list of rules, each accompanied by a brief explanation and example for clarity. This section is also validated through automated tests, ensuring the information is consistently accurate.

CLICK to see details about each rule
# It's a complete example of the CSV schema file in YAML format.
# See copy of the file without comments here ./schema-examples/full_clean.yml

# Just meta
name: CSV Blueprint Schema Example      # Name of a CSV file. Not used in the validation process.
description: |-                         # Any description of the CSV file. Not used in the validation process.
  This YAML file provides a detailed description and validation rules for CSV files
  to be processed by CSV Blueprint tool. It includes specifications for file name patterns,
  CSV formatting options, and extensive validation criteria for individual columns and their values,
  supporting a wide range of data validation rules from basic type checks to complex regex validations.
  This example serves as a comprehensive guide for creating robust CSV file validations.

presets:                                # Include another schema and define an alias for it.
  my-preset: ./preset_users.yml         # Define preset alias "my-preset". See README.md for details.

# Regular expression to match the file name. If not set, then no pattern check.
# This allows you to pre-validate the file name before processing its contents.
# Feel free to check parent directories as well.
# See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
filename_pattern: /\.csv$/i
#  preset: my-preset                    # See README.md for details.

# Here are default values to parse CSV file.
# You can skip this section if you don't need to override the default values.
csv:
  preset: my-preset                     # See README.md for details.
  header: true                          # If the first row is a header. If true, name of each column is required.
  delimiter: ,                          # Delimiter character in CSV file.
  quote_char: \                         # Quote character in CSV file.
  enclosure: '"'                        # Enclosure for each field in CSV file.
  encoding: utf-8                       # (Experimental) Only utf-8, utf-16, utf-32.
  bom: false                            # (Experimental) If the file has a BOM (Byte Order Mark) at the beginning.

# Structural rules for the CSV file. These rules are applied to the entire CSV file.
# They are not(!) related to the data in the columns.
# You can skip this section if you don't need to override the default values.
structural_rules: # Here are default values.
  preset: my-preset                     # See README.md for details.
  strict_column_order: true             # Ensure columns in CSV follow the same order as defined in this YML schema. It works only if "csv.header" is true.
  allow_extra_columns: false            # Allow CSV files to have more columns than specified in this YML schema.

# Description of each column in CSV.
# It is recommended to present each column in the same order as presented in the CSV file.
# This will not affect the validator, but will make it easier for you to navigate.
# For convenience, use the first line as a header (if possible).
columns:
  - preset: my-preset/login             # Add preset rules for the column. See README.md for details.
    name: Column Name (header)          # Any custom name of the column in the CSV file (first row). Required if "csv.header" is true.
    description: Lorem ipsum            # Description of the column. Not used in the validation process.
    example: Some example               # Example of the column value. Schema will also check this value on its own.

    # If the column is required. If true, the column must be present in the CSV file. If false, the column can be missing in the CSV file.
    # So, if you want to make the column optional, set this value to false, and it will validate the column only if it is present.
    # By default, the column is required. It works only if "csv.header" is true and "structural_rules.allow_extra_columns" is false.
    required: true

    ####################################################################################################################
    # Data validation for each(!) value in the column. Please, see notes in README.md
    # Every rule is optional.
    rules:
      preset: my-preset/login           # Add preset rules for the column. See README.md for details.

      # General rules
      not_empty: true                   # Value is not an empty string. Actually checks if the string length is not 0.
      exact_value: Some string          # Exact value for string in the column.
      allow_values: [ y, n, "" ]        # Strict set of values that are allowed.
      not_allow_values: [ invalid ]     # Strict set of values that are NOT allowed.

      # Any valid regex pattern. See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
      # Of course it's a super powerful tool to verify any sort of string data.
      # Please, be careful. Regex is a powerful tool, but it can be very dangerous if used incorrectly.
      # Remember that if you want to solve a problem with regex, you now have two problems.
      # But have it your way, then happy debugging! https://regex101.com
      regex: /^[\d]{2}$/

      # Checks length of a string including spaces (multibyte safe).
      length_min: 1                     # x >= 1
      length_greater: 2                 # x >  2
      length_not: 0                     # x != 0
      length: 7                         # x == 7
      length_less: 8                    # x <  8
      length_max: 9                     # x <= 9

      # Basic string checks
      is_trimmed: true                  # Only trimmed strings. Example: "Hello World" (not " Hello World ").
      is_lowercase: true                # String is only lower-case. Example: "hello world".
      is_uppercase: true                # String is only upper-case. Example: "HELLO WORLD".
      is_capitalize: true               # String is only capitalized. Example: "Hello World".
      is_sentence: true                 # Sentence with at least one space. Example: "Hello world!".

      # Count number of words used in a string
      # Note that multibyte locales are not supported.
      # Example: "Hello World, 123" - 2 words only (123 is not a word).
      word_count_min: 1                 # x >= 1
      word_count_greater: 2             # x >  2
      word_count_not: 0                 # x != 0
      word_count: 7                     # x == 7
      word_count_less: 8                # x <  8
      word_count_max: 9                 # x <= 9

      # Contains rules
      contains: World                   # Example: "Hello World!". The string must contain "World" in any place.
      contains_none: [ a, b ]           # All the strings must NOT be part of a CSV value.
      contains_one: [ a, b ]            # Only one of the strings must be part of the CSV value.
      contains_any: [ a, b ]            # At least one of the string must be part of the CSV value.
      contains_all: [ a, b ]            # All the strings must be part of a CSV value.
      starts_with: 'prefix '            # Example: "prefix Hello World".
      ends_with: ' suffix'              # Example: "Hello World suffix".

      # Numeric
      is_int: true                      # Check format only. Can be negative and positive. Without any separators.
      is_float: true                    # Check format only. Can be negative and positive. Dot as decimal separator.

      # Under the hood it converts and compares as float values.
      # Comparison accuracy is 10 digits after a dot.
      # Scientific number format is also supported. Example: "1.2e3"
      num_min: 1.0                      # x >= 1.0
      num_greater: 2.0                  # x >  2.0
      num_not: 5.0                      # x != 5.0
      num: 7.0                          # x == 7.0
      num_less: 8.0                     # x <  8.0
      num_max: 9.0                      # x <= 9.0

      # Number of digits after the decimal point (with zeros)
      precision_min: 1                  # x >= 1
      precision_greater: 2              # x >  2
      precision_not: 0                  # x != 0
      precision: 7                      # x == 7
      precision_less: 8                 # x <  8
      precision_max: 9                  # x <= 9

      # Date & time
      is_date: true                     # Accepts arbitrary date format. Is shows error if failed to convert to timestamp.
      is_timezone: true                 # Allow only timezone identifiers. Case-insensitive. Example: "Europe/London", "utc".
      is_timezone_offset: true          # Allow only timezone offsets. Example: "+03:00".
      is_time: true                     # Check if the cell value is a valid time in the format "HH:MM:SS AM/PM" / "HH:MM:SS" / "HH:MM". Case-insensitive.
      is_leap_year: true                # Check if the cell value is a leap year. Example: "2008", "2008-02-29 23:59:59 UTC".

      # Dates. Under the hood, the strings are converted to timestamp and compared.
      # This gives you the ability to use relative dates and any formatting you want.
      # By default, it works in UTC. But you can specify your own timezone as part of the date string.
      # Format:    https://www.php.net/manual/en/datetime.format.php
      # Parsing:   https://www.php.net/manual/en/function.strtotime.php
      # Timezones: https://www.php.net/manual/en/timezones.php
      date_min: -100 years              # Example of relative past date
      date_greater: -99 days            # Example of relative formats
      date_not: 2006-01-02 15:04:05 -0700 Europe/Rome
      date: 01 Jan 2000                 # You can use any string that can be parsed by the strtotime function
      date_less: now                    # Example of current date and time
      date_max: +1 day                  # Example of relative future date
      date_format: Y-m-d                # Check strict format of the date.

      # Date Intervals. Under the hood, the strings are converted to seconds and compared.
      # See: https://www.php.net/manual/en/class.dateinterval.php
      # See: https://www.php.net/manual/en/dateinterval.createfromdatestring.php
      date_interval_min: PT0S           # 0 seconds
      date_interval_greater: 1day 1sec  # 1 day and 1 second
      date_interval_not: 100 days       # Except for the 100 days
      date_interval: P2W                # Exactly 2 weeks
      date_interval_less: PT23H59M59S   # 23 hours, 59 minutes, and 59 seconds
      date_interval_max: P1Y            # 1 year

      # Check an arbitrary date in a CSV cell for age (years).
      # Actually it calculates the difference between the date and the current date.
      # Convenient to use for age restrictions based on birthday.
      # See the description of `date_*` functions for details on date formats.
      date_age_min: 1                   # x >= 1
      date_age_greater: 14              # x >  14
      date_age_not: 18                  # x != 18
      date_age: 21                      # x == 21
      date_age_less: 99                 # x <  99
      date_age_max: 100                 # x <= 100

      # Specific formats
      is_bool: true                     # Allow only boolean values "true" and "false", case-insensitive.
      is_binary: true                   # Both: with or without "0b" prefix. Example: "0b10" or "10".
      is_octal: true                    # Validates octal numbers in the format "0o123".
      is_hex: true                      # Both: with or without "0x" prefix. Example: "0x1A".
      is_uuid: true                     # Validates whether the input is a valid UUID. It also supports validation of specific versions 1, 3, 4 and 5.
      is_slug: true                     # Only slug format. Example: "my-slug-123". It can contain letters, numbers, and dashes.
      is_currency_code: true            # Validates an ISO 4217 currency code like GBP or EUR. Case-sensitive. See: https://en.wikipedia.org/wiki/ISO_4217.
      is_base64: true                   # Validate if a string is Base64-encoded. Example: "cmVzcGVjdCE=".
      is_angle: true                    # Check if the cell value is a valid angle (0.0 to 360.0).

      # Safity checks
      # Password strength calculation criteria include: Length (max 5 points, +1 every 2 characters),
      # presence of uppercase letters (+1), lowercase letters (+1), numbers (+1), special characters (+1),
      # spaces (+1), and penalties for consecutive sequences of uppercase, lowercase, or
      # numbers (-0.5 each), repetitive sequences (-0.75 each), common weak passwords like "qwerty",
      # and passwords under 6 characters (-2). Adjust scores to a 0 to 10 scale, with a minimum score of 0.
      password_strength_min: 1          # x >= 1
      password_strength_greater: 2      # x >  2
      password_strength_not: 0          # x != 0
      password_strength: 7              # x == 7
      password_strength_less: 8         # x <  8
      password_strength_max: 9          # x <= 9
      is_password_safe_chars: true      # Check that the cell value contains only safe characters for regular passwords. Allowed characters: a-z, A-Z, 0-9, !@#$%^&*()_+-=[]{};:'"|,.<>/?~.

      # Internet
      is_ip: true                       # Both: IPv4 or IPv6.
      is_ip_v4: true                    # Only IPv4. Example: "127.0.0.1".
      is_ip_v6: true                    # Only IPv6. Example: "2001:0db8:85a3:08d3:1319:8a2e:0370:7334".
      is_ip_private: true               # IPv4 has ranges: 10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16. IPv6 has ranges starting with FD or FC.
      is_ip_reserved: true              # IPv4 has ranges: 0.0.0.0/8, 169.254.0.0/16, 127.0.0.0/8 and 240.0.0.0/4. IPv6 has ranges: ::1/128, ::/128, ::ffff:0:0/96 and fe80::/10.
      ip_v4_range: [ '127.0.0.1-127.0.0.5', '127.0.0.0/21' ] # Check subnet mask or range for IPv4. Address must be in one of the ranges.
      is_mac_address: true              # The input is a valid MAC address. Example: 00:00:5e:00:53:01
      is_domain: true                   # Only domain name. Example: "example.com".
      is_public_domain_suffix: true     # The input is a public ICANN domain suffix. Example: "com", "nom.br", "net" etc.
      is_url: true                      # Only URL format. Example: "https://example.com/page?query=string#anchor".
      is_email: true                    # Only email format. Example: "user@example.com".

      # Validates if the given input is a valid JSON.
      # This is possible if you escape all special characters correctly and use a special CSV format.
      is_json: true                     # Example: {"foo":"bar"}.

      # Geography
      is_latitude: true                 # Can be integer or float. Example: 50.123456.
      is_longitude: true                # Can be integer or float. Example: -89.123456.
      is_geohash: true                  # Check if the value is a valid geohash. Example: "u4pruydqqvj".
      is_cardinal_direction: true       # Valid cardinal direction. Case-insensitive. Available values: ["N", "S", "E", "W", "NE", "SE", "NW", "SW", "NONE"]
      is_usa_market_name: true          # Check if the value is a valid USA market name. Example: "New York, NY".
      is_usa_state: true                # Name or code of USA state name. Case-insensitive. Example: "CA" or "California".

      # Validates whether the input is a country code in ISO 3166-1 standard.
      # Available options: "alpha-2" (Ex: "US"), "alpha-3" (Ex: "USA"), "numeric" (Ex: "840").
      # The rule uses data from iso-codes: https://salsa.debian.org/iso-codes-team/iso-codes.
      country_code: alpha-2             # Country code in ISO 3166-1 standard. Examples: "US", "USA", "840"

      # Validates whether the input is language code based on ISO 639.
      # Available options: "alpha-2" (Ex: "en"), "alpha-3" (Ex: "eng").
      # See: https://en.wikipedia.org/wiki/ISO_639.
      language_code: alpha-2            # Examples: "en", "eng"

      # Filesystem (with IO!)
      is_file_exists: true              # Check if file exists on the filesystem (It's FS IO operation!).
      is_dir_exists: true               # Check if directory exists on the filesystem (It's FS IO operation!).

      # Mathematical
      is_fibonacci: true                # Validates whether the input follows the Fibonacci integer sequence. Example: "8", "13".
      is_prime_number: true             # Validates a prime number. Example: "3", "5", "7", "11".
      is_even: true                     # Check if the value is an even number. Example: "2", "4", "6".
      is_odd: true                      # Check if the value is an odd number. Example: "1", "7", "11".
      is_roman: true                    # Validates if the input is a Roman numeral. Example: "I", "IV", "XX".
      is_luhn: true                     # Luhn algorithm. See: https://en.wikipedia.org/wiki/Luhn_algorithm

      # Identifications
      phone: ALL                        # Validates if the input is a phone number. Specify the country code to validate the phone number for a specific country. Example: "ALL", "US", "BR".".
      postal_code: US                   # Validate postal code by country code (alpha-2). Example: "02179". Extracted from https://www.geonames.org
      is_iban: true                     # IBAN - International Bank Account Number. See: https://en.wikipedia.org/wiki/International_Bank_Account_Number
      is_bic: true                      # Validates a Bank Identifier Code (BIC) according to ISO 9362 standards. See: https://en.wikipedia.org/wiki/ISO_9362
      is_imei: true                     # Validates an International Mobile Equipment Identity (IMEI). See: https://en.wikipedia.org/wiki/International_Mobile_Station_Equipment_Identity
      is_isbn: true                     # Validates an International Standard Book Number (ISBN). See: https://www.isbn-international.org/content/what-isbn

      # Misc
      is_version: true                  # Validates the string as version numbers using Semantic Versioning. Example: "1.2.3".
      is_punct: true                    # Validates whether the input composed by only punctuation characters. Example: "!@#$%^&*()".
      is_vowel: true                    # Validates whether the input contains only vowels. Example: "aei".
      is_consonant: true                # Validates if the input contains only consonants. Example: "bcd".
      is_alnum: true                    # Validates whether the input is only alphanumeric. Example: "aBc123".
      is_alpha: true                    # This is similar to `is_alnum`, but it does not allow numbers. Example: "aBc".
      is_hex_rgb_color: true            # Validates weather the input is a hex RGB color or not. Examples: "#FF0000", "#123", "ffffff", "fff".

      # Check if the value is a valid hash. Supported algorithms:
      #  - md5, md4, md2, sha1, sha224, sha256, sha384, sha512/224, sha512/256, sha512
      #  - sha3-224, sha3-256, sha3-384, sha3-512, ripemd128, ripemd160, ripemd256, ripemd320, whirlpool, tiger128,3
      #  - tiger160,3, tiger192,3, tiger128,4, tiger160,4, tiger192,4, snefru, snefru256, gost, gost-crypto, crc32
      #  - crc32b, crc32c, adler32, fnv132, fnv1a32, fnv164, fnv1a64, joaat, murmur3a, murmur3c
      #  - murmur3f, xxh32, xxh64, xxh3, xxh128, haval128,3, haval160,3, haval192,3, haval224,3, haval256,3
      #  - haval128,4, haval160,4, haval192,4, haval224,4, haval256,4, haval128,5, haval160,5, haval192,5, haval224,5, haval256,5
      hash: set_algo                    # Example: "1234567890abcdef".

      # Check if a string is in a specific charset. Available charsets:
      #  - 7bit, 8bit, ASCII, ArmSCII-8, BASE64, BIG-5, CP850, CP866, CP932, CP936
      #  - CP950, CP50220, CP50221, CP50222, CP51932, EUC-CN, EUC-JP, EUC-JP-2004, EUC-KR, EUC-TW
      #  - GB18030, HTML-ENTITIES, HZ, ISO-2022-JP, ISO-2022-JP-2004, ISO-2022-JP-MOBILE#KDDI, ISO-2022-JP-MS, ISO-2022-KR, ISO-8859-1, ISO-8859-2
      #  - ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-10, ISO-8859-13, ISO-8859-14
      #  - ISO-8859-15, ISO-8859-16, JIS, KOI8-R, KOI8-U, Quoted-Printable, SJIS, SJIS-2004, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI
      #  - SJIS-Mobile#SOFTBANK, SJIS-mac, SJIS-win, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UHC
      #  - UTF-7, UTF-8, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, UTF-16, UTF-16BE, UTF-16LE, UTF-32
      #  - UTF-32BE, UTF-32LE, UTF7-IMAP, UUENCODE, Windows-1251, Windows-1252, Windows-1254, eucJP-win
      charset: charset_code             # Validates if a string is in a specific charset. Example: "UTF-8".

      # Validates whether the input is a credit card number.
      # Available credit card brands: "Any", "American Express", "Diners Club", "Discover", "JCB", "MasterCard", "Visa", "RuPay".
      credit_card: Any                  # Example: "5376-7473-9720-8720"

    ####################################################################################################################
    # Data validation for the entire(!) column using different data aggregation methods.
    # Every rule is optional.
    aggregate_rules:
      preset: my-preset/login           # Add preset aggregate rules for the column. See README.md for details.

      is_unique: true                   # All values in the column are unique.

      # Check if the column is sorted in a specific order.
      #  - Direction: ["asc", "desc"].
      #  - Method:    ["numeric", "string", "natural", "regular"].
      # See: https://www.php.net/manual/en/function.sort.php
      sorted: [ asc, natural ]          # Expected ascending order, natural sorting.

      # First number in the column. Expected value is float or integer.
      first_num_min: 1.0                # x >= 1.0
      first_num_greater: 2.0            # x >  2.0
      first_num_not: 5.0                # x != 5.0
      first_num: 7.0                    # x == 7.0
      first_num_less: 8.0               # x <  8.0
      first_num_max: 9.0                # x <= 9.0
      first: Expected                   # First value in the column. Will be compared as strings.
      first_not: Not expected           # Not allowed as the first value in the column. Will be compared as strings.

      # N-th value in the column.
      # The rule expects exactly two arguments: the first is the line number (without header), the second is the expected value.
      # Example: `[ 42, 5.0 ]` On the line 42 (disregarding the header), we expect the 5.0. The comparison is always as float.
      nth_num_min: [ 42, 1.0 ]          # x >= 1.0
      nth_num_greater: [ 42, 2.0 ]      # x >  2.0
      nth_num_not: [ 42, 5.0 ]          # x != 5.0
      nth_num: [ 42, 7.0 ]              # x == 7.0
      nth_num_less: [ 42, 8.0 ]         # x <  8.0
      nth_num_max: [ 42, 9.0 ]          # x <= 9.0
      nth: [ 2, Expected ]              # Nth value in the column. Will be compared as strings.
      nth_not: [ 2, Not expected ]      # Not allowed as the N-th value in the column. Will be compared as strings.

      # Last number in the column. Expected value is float or integer.
      last_num_min: 1.0                 # x >= 1.0
      last_num_greater: 2.0             # x >  2.0
      last_num_not: 5.0                 # x != 5.0
      last_num: 7.0                     # x == 7.0
      last_num_less: 8.0                # x <  8.0
      last_num_max: 9.0                 # x <= 9.0
      last: Expected                    # Last value in the column. Will be compared as strings.
      last_not: Not expected            # Not allowed as the last value in the column. Will be compared as strings.

      # Sum of the numbers in the column. Example: [1, 2, 3] => 6.
      sum_min: 1.0                      # x >= 1.0
      sum_greater: 2.0                  # x >  2.0
      sum_not: 5.0                      # x != 5.0
      sum: 7.0                          # x == 7.0
      sum_less: 8.0                     # x <  8.0
      sum_max: 9.0                      # x <= 9.0

      # Regular the arithmetic mean. The sum of the numbers divided by the count.
      average_min: 1.0                  # x >= 1.0
      average_greater: 2.0              # x >  2.0
      average_not: 5.0                  # x != 5.0
      average: 7.0                      # x == 7.0
      average_less: 8.0                 # x <  8.0
      average_max: 9.0                  # x <= 9.0

      # Total number of rows in the CSV file.
      # Since any(!) values are taken into account, it only makes sense to use these rules once in any column.
      count_min: 1                      # x >= 1
      count_greater: 2                  # x >  2
      count_not: 0                      # x != 0
      count: 7                          # x == 7
      count_less: 8                     # x <  8
      count_max: 9                      # x <= 9

      # Counts only empty values (string length is 0).
      count_empty_min: 1                # x >= 1
      count_empty_greater: 2            # x >  2
      count_empty_not: 0                # x != 0
      count_empty: 7                    # x == 7
      count_empty_less: 8               # x <  8
      count_empty_max: 9                # x <= 9

      # Counts only not empty values (string length is not 0).
      count_not_empty_min: 1            # x >= 1
      count_not_empty_greater: 2        # x >  2
      count_not_empty_not: 0            # x != 0
      count_not_empty: 7                # x == 7
      count_not_empty_less: 8           # x <  8
      count_not_empty_max: 9            # x <= 9

      # Number of unique values.
      count_distinct_min: 1             # x >= 1
      count_distinct_greater: 2         # x >  2
      count_distinct_not: 0             # x != 0
      count_distinct: 7                 # x == 7
      count_distinct_less: 8            # x <  8
      count_distinct_max: 9             # x <= 9

      # Number of positive values.
      count_positive_min: 1             # x >= 1
      count_positive_greater: 2         # x >  2
      count_positive_not: 0             # x != 0
      count_positive: 7                 # x == 7
      count_positive_less: 8            # x <  8
      count_positive_max: 9             # x <= 9

      # Number of negative values.
      count_negative_min: 1             # x >= 1
      count_negative_greater: 2         # x >  2
      count_negative_not: 0             # x != 0
      count_negative: 7                 # x == 7
      count_negative_less: 8            # x <  8
      count_negative_max: 9             # x <= 9

      # Number of zero values. Any text and spaces (i.e. anything that doesn't look like a number) will be converted to 0.
      count_zero_min: 1                 # x >= 1
      count_zero_greater: 2             # x >  2
      count_zero_not: 0                 # x != 0
      count_zero: 7                     # x == 7
      count_zero_less: 8                # x <  8
      count_zero_max: 9                 # x <= 9

      # Number of even values.
      count_even_min: 1                 # x >= 1
      count_even_greater: 2             # x >  2
      count_even_not: 0                 # x != 0
      count_even: 7                     # x == 7
      count_even_less: 8                # x <  8
      count_even_max: 9                 # x <= 9

      # Number of odd values.
      count_odd_min: 1                  # x >= 1
      count_odd_greater: 2              # x >  2
      count_odd_not: 0                  # x != 0
      count_odd: 7                      # x == 7
      count_odd_less: 8                 # x <  8
      count_odd_max: 9                  # x <= 9

      # Number of prime values.
      count_prime_min: 1                # x >= 1
      count_prime_greater: 2            # x >  2
      count_prime_not: 0                # x != 0
      count_prime: 7                    # x == 7
      count_prime_less: 8               # x <  8
      count_prime_max: 9                # x <= 9

      # Calculate the median average of a list of numbers.
      # See: https://en.wikipedia.org/wiki/Median
      median_min: 1.0                   # x >= 1.0
      median_greater: 2.0               # x >  2.0
      median_not: 5.0                   # x != 5.0
      median: 7.0                       # x == 7.0
      median_less: 8.0                  # x <  8.0
      median_max: 9.0                   # x <= 9.0

      # Harmonic mean (subcontrary mean). The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals.
      # Appropriate for situations when the average of rates is desired.
      # See: https://en.wikipedia.org/wiki/Harmonic_mean
      harmonic_mean_min: 1.0            # x >= 1.0
      harmonic_mean_greater: 2.0        # x >  2.0
      harmonic_mean_not: 5.0            # x != 5.0
      harmonic_mean: 7.0                # x == 7.0
      harmonic_mean_less: 8.0           # x <  8.0
      harmonic_mean_max: 9.0            # x <= 9.0

      # Geometric mean. A type of mean which indicates the central tendency or typical value of a set of numbers
      # by using the product of their values (as opposed to the arithmetic mean which uses their sum).
      # See: https://en.wikipedia.org/wiki/Geometric_mean
      geometric_mean_min: 1.0           # x >= 1.0
      geometric_mean_greater: 2.0       # x >  2.0
      geometric_mean_not: 5.0           # x != 5.0
      geometric_mean: 7.0               # x == 7.0
      geometric_mean_less: 8.0          # x <  8.0
      geometric_mean_max: 9.0           # x <= 9.0

      # Contraharmonic mean. A function complementary to the harmonic mean. A special case of the Lehmer mean, L₂(x), where p = 2.
      # See: https://en.wikipedia.org/wiki/Contraharmonic_mean
      contraharmonic_mean_min: 1.0      # x >= 1.0
      contraharmonic_mean_greater: 2.0  # x >  2.0
      contraharmonic_mean_not: 5.0      # x != 5.0
      contraharmonic_mean: 7.0          # x == 7.0
      contraharmonic_mean_less: 8.0     # x <  8.0
      contraharmonic_mean_max: 9.0      # x <= 9.0

      # Root mean square (quadratic mean) The square root of the arithmetic mean of the squares of a set of numbers.
      # See: https://en.wikipedia.org/wiki/Root_mean_square
      root_mean_square_min: 1.0         # x >= 1.0
      root_mean_square_greater: 2.0     # x >  2.0
      root_mean_square_not: 5.0         # x != 5.0
      root_mean_square: 7.0             # x == 7.0
      root_mean_square_less: 8.0        # x <  8.0
      root_mean_square_max: 9.0         # x <= 9.0

      # Trimean (TM, or Tukey's trimean).
      # A measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles.
      # See: https://en.wikipedia.org/wiki/Trimean
      trimean_min: 1.0                  # x >= 1.0
      trimean_greater: 2.0              # x >  2.0
      trimean_not: 5.0                  # x != 5.0
      trimean: 7.0                      # x == 7.0
      trimean_less: 8.0                 # x <  8.0
      trimean_max: 9.0                  # x <= 9.0

      # Cubic mean. See: https://en.wikipedia.org/wiki/Cubic_mean
      cubic_mean_min: 1.0               # x >= 1.0
      cubic_mean_greater: 2.0           # x >  2.0
      cubic_mean_not: 5.0               # x != 5.0
      cubic_mean: 7.0                   # x == 7.0
      cubic_mean_less: 8.0              # x <  8.0
      cubic_mean_max: 9.0               # x <= 9.0

      # Compute the P-th percentile of a list of numbers.
      # Linear interpolation between closest ranks method - Second variant, C = 1 P-th percentile (0 <= P <= 100) of a list of N ordered values (sorted from least to greatest).
      # Similar method used in NumPy and Excel.
      # See: https://en.wikipedia.org/wiki/Percentile#Second_variant.2C_.7F.27.22.60UNIQ--postMath-00000043-QINU.60.22.27.7F
      # Example: `[ 95.5, 1.234 ]` The 95.5th percentile in the column must be "1.234" (float).
      percentile_min: [ 95.0, 1.0 ]     # x >= 1.0
      percentile_greater: [ 95.0, 2.0 ] # x >  2.0
      percentile_not: [ 95.0, 5.0 ]     # x != 5.0
      percentile: [ 95.0, 7.0 ]         # x == 7.0
      percentile_less: [ 95.0, 8.0 ]    # x <  8.0
      percentile_max: [ 95.0, 9.0 ]     # x <= 9.0

      # Quartiles. Three points that divide the data set into four equal groups, each group comprising a quarter of the data.
      # See: https://en.wikipedia.org/wiki/Quartile
      # There are multiple methods for computing quartiles: ["exclusive", "inclusive"]. Exclusive is ussually classic.
      # Available types: ["0%", "Q1", "Q2", "Q3", "100%", "IQR"] ("IQR" is Interquartile Range)
      # Example: `[ inclusive, 'Q3', 42.0 ]` - the Q3 inclusive quartile is 42.0
      quartiles_min: [ exclusive, '0%', 1.0 ]               # x >= 1.0
      quartiles_greater: [ inclusive, 'Q1', 2.0 ]           # x >  2.0
      quartiles_not: [ exclusive, 'Q2', 5.0 ]               # x != 5.0
      quartiles: [ inclusive, 'Q3', 7.0 ]                   # x == 7.0
      quartiles_less: [ exclusive, '100%', 8.0 ]            # x <  8.0
      quartiles_max: [ inclusive, 'IQR', 9.0 ]              # x <= 9.0

      # Midhinge. The average of the first and third quartiles and is thus a measure of location.
      # Equivalently, it is the 25% trimmed mid-range or 25% midsummary; it is an L-estimator.
      # See: https://en.wikipedia.org/wiki/Midhinge
      # Midhinge = (first quartile, third quartile) / 2
      midhinge_min: 1.0                 # x >= 1.0
      midhinge_greater: 2.0             # x >  2.0
      midhinge_not: 5.0                 # x != 5.0
      midhinge: 7.0                     # x == 7.0
      midhinge_less: 8.0                # x <  8.0
      midhinge_max: 9.0                 # x <= 9.0

      # MAD - mean absolute deviation. The average of the absolute deviations from a central point.
      # It is a summary statistic of statistical dispersion or variability.
      # See: https://en.wikipedia.org/wiki/Average_absolute_deviation
      mean_abs_dev_min: 1.0             # x >= 1.0
      mean_abs_dev_greater: 2.0         # x >  2.0
      mean_abs_dev_not: 5.0             # x != 5.0
      mean_abs_dev: 7.0                 # x == 7.0
      mean_abs_dev_less: 8.0            # x <  8.0
      mean_abs_dev_max: 9.0             # x <= 9.0

      # MAD - median absolute deviation. The average of the absolute deviations from a central point.
      # It is a summary statistic of statistical dispersion or variability.
      # It is a robust measure of the variability of a univariate sample of quantitative data.
      # See: https://en.wikipedia.org/wiki/Median_absolute_deviation
      median_abs_dev_min: 1.0           # x >= 1.0
      median_abs_dev_greater: 2.0       # x >  2.0
      median_abs_dev_not: 5.0           # x != 5.0
      median_abs_dev: 7.0               # x == 7.0
      median_abs_dev_less: 8.0          # x <  8.0
      median_abs_dev_max: 9.0           # x <= 9.0

      # Population variance - Use when all possible observations of the system are present.
      # If used with a subset of data (sample variance), it will be a biased variance.
      # n degrees of freedom, where n is the number of observations.
      population_variance_min: 1.0      # x >= 1.0
      population_variance_greater: 2.0  # x >  2.0
      population_variance_not: 5.0      # x != 5.0
      population_variance: 7.0          # x == 7.0
      population_variance_less: 8.0     # x <  8.0
      population_variance_max: 9.0      # x <= 9.0

      # Unbiased sample variance Use when only a subset of all possible observations of the system are present.
      # n - 1 degrees of freedom, where n is the number of observations.
      sample_variance_min: 1.0          # x >= 1.0
      sample_variance_greater: 2.0      # x >  2.0
      sample_variance_not: 5.0          # x != 5.0
      sample_variance: 7.0              # x == 7.0
      sample_variance_less: 8.0         # x <  8.0
      sample_variance_max: 9.0          # x <= 9.0

      # Standard deviation (For a sample; uses sample variance). It also known as SD or StdDev.
      # StdDev is a measure that is used to quantify the amount of variation or dispersion of a set of data values.
      #  - Low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set.
      #  - High standard deviation indicates that the data points are spread out over a wider range of values.
      # See: https://en.wikipedia.org/wiki/Standard_deviation
      stddev_min: 1.0                   # x >= 1.0
      stddev_greater: 2.0               # x >  2.0
      stddev_not: 5.0                   # x != 5.0
      stddev: 7.0                       # x == 7.0
      stddev_less: 8.0                  # x <  8.0
      stddev_max: 9.0                   # x <= 9.0

      # SD+ (Standard deviation for a population; uses population variance)
      stddev_pop_min: 1.0               # x >= 1.0
      stddev_pop_greater: 2.0           # x >  2.0
      stddev_pop_not: 5.0               # x != 5.0
      stddev_pop: 7.0                   # x == 7.0
      stddev_pop_less: 8.0              # x <  8.0
      stddev_pop_max: 9.0               # x <= 9.0

      # Coefficient of variation (cᵥ) Also known as relative standard deviation (RSD)
      # A standardized measure of dispersion of a probability distribution or frequency distribution.
      # It is often expressed as a percentage. The ratio of the standard deviation to the mean.
      # See: https://en.wikipedia.org/wiki/Coefficient_of_variation
      coef_of_var_min: 1.0              # x >= 1.0
      coef_of_var_greater: 2.0          # x >  2.0
      coef_of_var_not: 5.0              # x != 5.0
      coef_of_var: 7.0                  # x == 7.0
      coef_of_var_less: 8.0             # x <  8.0
      coef_of_var_max: 9.0              # x <= 9.0

      # Interquartile mean (IQM). A measure of central tendency based on the truncated mean of the interquartile range.
      # Only the data in the second and third quartiles is used (as in the interquartile range), and the lowest 25% and the highest 25% of the scores are discarded.
      # See: https://en.wikipedia.org/wiki/Interquartile_mean
      # Note: It's SUPER slow!!!
      interquartile_mean_min: 1.0       # x >= 1.0
      interquartile_mean_greater: 2.0   # x >  2.0
      interquartile_mean_not: 5.0       # x != 5.0
      interquartile_mean: 7.0           # x == 7.0
      interquartile_mean_less: 8.0      # x <  8.0
      interquartile_mean_max: 9.0       # x <= 9.0

  - name: another_column
    rules:
      not_empty: true

  - name: inherited_column_login
    preset: my-preset/login

  - name: inherited_column_full_name
    preset: my-preset/full_name

Extra checks

In addition to the schema outlined in the YAML configuration, the tool performs several underlying checks by default to ensure thorough validation of your CSV files.

  • The filename_pattern rule verifies that the file name adheres to the specified regex pattern, ensuring file naming conventions are followed.
  • Ensures that the name property is defined for each column, applicable only when csv.header is set to true, to guarantee header integrity.
  • The required property, when set to true, mandates the presence of the specified column in the CSV file, enhancing data completeness. This is only relevant if csv.header is true.
  • Validates that each row contains the correct number of columns, aligning with the schema's defined structure, to prevent data misalignment.
  • The strict_column_order rule checks for the correct sequential order of columns as defined in the schema, ensuring structural consistency.
  • The allow_extra_columns rule asserts no additional columns are present in the CSV file beyond those specified in the schema, maintaining strict data fidelity.
    • For csv.header: true, it checks if the schema contains any column name not found in the CSV file, addressing header discrepancies.
    • For csv.header: false, it compares the number of columns in the schema against those in the CSV file, ensuring schema conformity.

Presets and reusable schemas

Presets significantly enhance the efficiency and reusability of schema definitions in CSV file validation by ensuring consistency across various files with common validation rules for fields like user IDs and email addresses.

This uniformity maintains data integrity and simplifies maintenance by allowing centralized updates that automatically apply to all linked schemas. Moreover, presets support customization through field-specific rule overrides, facilitating both standardization and specific needs adaptation.

Focus on your task rather than copy&pasting.

Example with presets

Let's look at a real life example. Suppose you have a "library" of different user profile validation rules that can be used in a wide variety of CSV files.

In order not to care about integrity and not to suffer from copy and paste, you can reuse ANY(!) existing schema. In fact, this can be considered as partial inheritance.

Important notes

  • You can make the chain of inheritance infinitely long. I.e. make chains of the form grant-parent.yml -> parent.yml -> child.yml -> grandchild.yml -> etc. Of course if you like to take risks ;).
  • But be careful with circular dependencies. The tool will not be able to handle them, and it can be an infinite loop.
  • Any(!) of the schema files can be used alone or as a library. The syntax is the same.
  • Schemas with presets validate themselves and if there are any obvious issues, you will see them when you try to use the schema. But logical conflicts between rules are not checked (It's almost impossible from a code perspective). As mentioned above, rules work in isolation and are not aware of each other. So the set of rules is your responsibility as always.
  • Alias in presets must match the regex pattern /^[a-z0-9-_]+$/i. Otherwise, it might break the syntax.

If something went wrong

If you're having trouble working with presets and don't understand how the CSV Blueprint under the hood understands it, just add --dump-schema to see it. Also, there is a separate CLI command to dump schema:

./csv-blueprint debug-schema -s ./your/schema.yml

Let's take a look at what this looks like in code.

  • Define a couple of basic rules for database columns.
  • Also, one of the files will contain rules specific only to the users profile.
  • And of course, let's make a schema that will simultaneously reuse the rules from these two files.

As a result, you don't just get a bunch of schemas for validation, which is difficult to manage, but something like a framework(!) that will be targeted to the specifics of your project, especially when there are dozens or even hundreds of CSV files and rules. It will be much easier to achieve consistency. Very often it's quite important.

Database preset

CLICK to see source code
name: Presets for database columns
description: This schema contains basic rules for database user data.

columns:
  - name: id
    description: Unique identifier, usually used to denote a primary key in databases.
    example: 12345
    rules:
      not_empty: true
      is_trimmed: true
      is_int: true
      num_min: 1
    aggregate_rules:
      is_unique: true
      sorted: [ asc, numeric ]

  - name: status
    description: Status in database
    example: active
    rules:
      not_empty: true
      allow_values: [ active, inactive, pending, deleted ]

User data preset

CLICK to see source code
name: Common presets for user data
description: >
  This schema contains common presets for user data.
  It can be used as a base for other schemas.

filename_pattern: /users-.*\.csv$/i

csv:
  delimiter: ';'

columns:
  - name: login
    description: User's login name
    example: johndoe
    rules:
      not_empty: true
      is_trimmed: true
      is_lowercase: true
      is_slug: true
      length_min: 3
      length_max: 20
      is_alnum: true
    aggregate_rules:
      is_unique: true

  - name: password
    description: User's password
    example: '9RfzENKD'
    rules:
      not_empty: true
      is_trimmed: true
      is_password_safe_chars: true
      password_strength_min: 7
      contains_none: [ "password", "123456", "qwerty", " " ]
      charset: UTF-8
      length_min: 6
      length_max: 20

  - name: full_name
    description: User's full name
    example: 'John Doe Smith'
    rules:
      not_empty: true
      is_trimmed: true
      charset: UTF-8
      contains: " "
      word_count_min: 2
      word_count_max: 8
      is_capitalize: true
    aggregate_rules:
      is_unique: true

  - name: email
    description: User's email address
    example: user@example.com
    rules:
      not_empty: true
      is_trimmed: true
      is_email: true
      is_lowercase: true
    aggregate_rules:
      is_unique: true

  - name: birthday
    description: Validates the user's birthday.
    example: '1990-01-01'
    rules:
      not_empty: true            # The birthday field must not be empty.
      is_trimmed: true           # Trims the value before validation.
      date_format: Y-m-d         # Checks if the date matches the YYYY-MM-DD format.
      is_date: true              # Validates if the value is a valid date.
      date_age_greater: 0        # Ensures the date is in the past.
      date_age_less: 150         # Ensures the user is not older than 150 years.
      date_max: now              # Ensures the date is not in the future.

  - name: phone_number
    description: User's phone number in US
    example: '+1 650 253 00 00'
    rules:
      not_empty: true
      is_trimmed: true
      starts_with: '+1'
      phone: US

  - name: balance
    description: User's balance in USD
    example: '100.00'
    rules:
      not_empty: true
      is_trimmed: true
      is_float: true
      num_min: 0.00
      num_max: 1000000000.00      # 1 billion is max amount in our system.
      precision: 2

  - name: short_description
    description: A brief description of the item
    example: 'Lorem ipsum dolor sit amet'
    rules:
      not_empty: true
      contains: " "
      length_max: 255
      is_trimmed: true

Usage of presets This short and clear Yaml under the hood as roughly as follows. As you can see it simplifies your work a lot.

name: Schema uses presets and add new columns + specific rules.
description: This schema uses presets. Also, it demonstrates how to override preset values.

presets: # Include any other schemas and defined for each alias.
  users: ./preset_users.yml       # Include the schema with common user data.
  db: ./preset_database.yml       # Include the schema with basic database columns.

csv:
  preset: users                   # Take the CSV settings from the preset.
  enclosure: '|'                  # Overridden enclosure only for this schema.

columns:
  # Grap only needed columns from the preset in specific order.
  - preset: db/id
  - preset: db/status
  - preset: users/login
  - preset: users/email
  - preset: users/full_name
  - preset: users/birthday
  - preset: users/phone_number    # Rename the column. "phone_number" => "phone".
    name: phone
  - preset: users/password        # Overridden value to force a strong password.
    rules: { length_min: 10 }
  - name: admin_note              # New column specific only this schema.
    description: Admin note
    rules:
      not_empty: true
      length_min: 1
      length_max: 10
    aggregate_rules:              # In practice this will be a rare case, but the opportunity is there.
      preset: db/id               # Take only aggregate rules from the preset.
      is_unique: true             # Added new specific aggregate rule.
CLICK to see what it looks like in memory.
# Schema file is "./schema-examples/preset_usage.yml"
name: 'Schema uses presets and add new columns + specific rules.'
description: 'This schema uses presets. Also, it demonstrates how to override preset values.'
presets:
  users: ./schema-examples/preset_users.yml
  db: ./schema-examples/preset_database.yml
csv:
  delimiter: ;
  enclosure: '|'
columns:
  - name: id
    description: 'Unique identifier, usually used to denote a primary key in databases.'
    example: 12345
    rules:
      not_empty: true
      is_trimmed: true
      is_int: true
      num_min: 1
    aggregate_rules:
      is_unique: true
      sorted:
        - asc
        - numeric

  - name: status
    description: 'Status in database'
    example: active
    rules:
      not_empty: true
      allow_values:
        - active
        - inactive
        - pending
        - deleted

  - name: login
    description: "User's login name"
    example: johndoe
    rules:
      not_empty: true
      length_min: 3
      length_max: 20
      is_trimmed: true
      is_lowercase: true
      is_slug: true
      is_alnum: true
    aggregate_rules:
      is_unique: true

  - name: email
    description: "User's email address"
    example: user@example.com
    rules:
      not_empty: true
      is_trimmed: true
      is_lowercase: true
      is_email: true
    aggregate_rules:
      is_unique: true

  - name: full_name
    description: "User's full name"
    example: 'John Doe Smith'
    rules:
      not_empty: true
      is_trimmed: true
      is_capitalize: true
      word_count_min: 2
      word_count_max: 8
      contains: ' '
      charset: UTF-8
    aggregate_rules:
      is_unique: true

  - name: birthday
    description: "Validates the user's birthday."
    example: '1990-01-01'
    rules:
      not_empty: true
      is_trimmed: true
      is_date: true
      date_max: now
      date_format: Y-m-d
      date_age_greater: 0
      date_age_less: 150

  - name: phone
    description: "User's phone number in US"
    example: '+1 650 253 00 00'
    rules:
      not_empty: true
      is_trimmed: true
      starts_with: '+1'
      phone: US

  - name: password
    description: "User's password"
    example: 9RfzENKD
    rules:
      not_empty: true
      length_min: 10
      length_max: 20
      is_trimmed: true
      contains_none:
        - password
        - '123456'
        - qwerty
        - ' '
      password_strength_min: 7
      is_password_safe_chars: true
      charset: UTF-8

  - name: admin_note
    description: 'Admin note'
    rules:
      not_empty: true
      length_min: 1
      length_max: 10
    aggregate_rules:
      is_unique: true
      sorted:
        - asc
        - numeric

As a result, readability and maintainability became dramatically easier. You can easily add new rules, change existing, etc.

Complete example with all available syntax

CLICK to see available syntax.
name: Complite list of preset features
description: This schema contains all the features of the presets.

presets:
  # The basepath for the preset is `.` (current directory of the current schema file).
  # Define alias "db" for schema in `./preset_database.yml`.
  db: preset_database.yml           # Or `db: ./preset_database.yml`. It's up to you.

  # For example, you can use a relative path.
  users: ./../schema-examples/preset_users.yml

  # Or you can use an absolute path.
  # db: /full/path/preset_database.yml

filename_pattern: { preset: users } # Take the filename pattern from the preset.
structural_rules: { preset: users } # Take the global rules from the preset.
csv: { preset: users }              # Take the CSV settings from the preset.

columns:
  # Use name of column from the preset.
  # "db" is alias. "id" is column `name` in `preset_database.yml`.
  - preset: 'db/id'

  # Use column index. "db" is alias. "0" is column index in `preset_database.yml`.
  - preset: 'db/0'
  - preset: 'db/0:'

  # Use column index and column name. It useful if column name is not unique.
  - preset: 'db/0:id'

  # Use only `rules` of "status" column from the preset.
  - name: My column
    rules:
      preset: 'db/status'

  # Override only `aggregate_rules` from the preset.
  # Use only `aggregate_rules` of "id" column from the preset.
  # We strictly take only the very first column (index = 0).
  - name: My column
    aggregate_rules:
      preset: 'db/0:id'

  # Combo!!! If you're a risk-taker or have a high level of inner zen. :)
  # Creating a column from three other columns.
  # In fact, it will merge all three at once with key replacement.
  - name: Crazy combo!
    description: >                  # Just a great advice.
      I like to take risks, too.
      Be careful. Use your power wisely.
    example: ~                      # Ignore inherited "example" value. Set it `null`.
    preset: 'users/login'
    rules:
      preset: 'users/email'
      not_empty: true               # Disable the rule from the preset.
    aggregate_rules:
      preset: 'db/0'

Note: All provided YAML examples pass built-in validation, yet they may not make practical sense. These are intended solely for demonstration and to illustrate potential configurations and features.

Parallel processing

The --parallel option is available for speeding up the validation of CSV files by utilizing more CPU resources effectively.

Key Points

  • Experimental Feature: This feature is currently experimental and requires further debugging and testing. Although it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability.
  • Use Case: This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or more to process.
  • Default Behavior: If you use --parallel without specifying a value, it defaults to using the maximum number of available CPU cores.
  • Thread Pool Size: You can set a specific number of threads for the pool. For example, --parallel=10 will set the thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU. Otherwise, it will only slow things down a bit due to the system overhead to handle multithreading.
  • Disabling Parallelism: Using --parallel=1 disables parallel processing, which is the default setting if the option is not specified.
  • Implementation: The feature relies on the ext-parallel PHP extension, which enables the creation of lightweight threads rather than processes. This extension is already included in our Docker image. Ensure that you have the ext-parallel extension installed if you are not using our Docker image. This extension is crucial for the operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is not installed.

Complete CLI help message

This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its CLI. The tool offers a comprehensive set of options to cater to various needs and scenarios, ensuring flexibility and efficiency in CSV file validation and manipulation.

For detailed information on each command and option, refer to the JBZoo/Cli documentation. This resource provides insights into the functionality and application of the CLI commands, helping users make the most out of the tool's capabilities.

./csv-blueprint validate-csv --help

CLICK to see validate-csv help messege
Description:
  Validate CSV file(s) by schema(s).

Usage:
  validate-csv [options]
  validate:csv

Options:
  -c, --csv=CSV                    Specify the path(s) to the CSV files you want to validate.
                                   This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
                                   Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
                                    (multiple values allowed)
  -s, --schema=SCHEMA              Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
                                   Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
                                   Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
                                    (multiple values allowed)
  -S, --skip-schema[=SKIP-SCHEMA]  Skips schema validation for quicker checks when the schema's correctness is certain.
                                   Use any non-empty value or "yes" to activate
                                    [default: "no"]
  -a, --apply-all[=APPLY-ALL]      Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
                                   Available options:
                                   - auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
                                   - yes|y|1: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
                                   - no|n|0: Apply only schemas with not empty `filename_pattern` and match the CSV files.
                                   Note. If specify the option `--apply-all` without value, it will be treated as "yes".
                                    [default: "auto"]
  -Q, --quick[=QUICK]              Stops the validation process upon encountering the first error,
                                   accelerating the check but limiting error visibility.
                                   Returns a non-zero exit code if any error is detected.
                                   Enable by setting to any non-empty value or "yes".
                                    [default: "no"]
  -r, --report=REPORT              Determines the report's output format.
                                   Available options: text, table, github, gitlab, teamcity, junit
                                    [default: "table"]
      --dump-schema                Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                      Intended solely for debugging and advanced profiling purposes.
                                   Activating this option provides detailed process insights,
                                   useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]        EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                   You can specify the number of threads.
                                   If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                   By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress                Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors                Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                   It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only                For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error          None-zero exit code on any StdErr message.
      --timestamp                  Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                    Display timing and memory usage information.
      --output-mode=OUTPUT-MODE    Output format. Available options:
                                   text - Default text output format, userfriendly and easy to read.
                                   cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                   It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                   logstash - Logstash output format, for integration with ELK stack.
                                    [default: "text"]
      --cron                       Alias for --output-mode=cron. Deprecated!
  -h, --help                       Display help for the given command. When no command is given display help for the list command
  -q, --quiet                      Do not output any message
  -V, --version                    Display this application version
      --ansi|--no-ansi             Force (or disable --no-ansi) ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

./csv-blueprint validate-schema --help

CLICK to see validate-schema help messege
Description:
  Validate syntax in schema file(s).

Usage:
  validate-schema [options]

Options:
  -s, --schema=SCHEMA            Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
                                 Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
                                 Examples: /full/path/name.yml; p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
                                  (multiple values allowed)
  -Q, --quick[=QUICK]            Stops the validation process upon encountering the first error,
                                 accelerating the check but limiting error visibility.
                                 Returns a non-zero exit code if any error is detected.
                                 Enable by setting to any non-empty value or "yes".
                                  [default: "no"]
  -r, --report=REPORT            Determines the report's output format.
                                 Available options: text, table, github, gitlab, teamcity, junit
                                  [default: "table"]
      --dump-schema              Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                    Intended solely for debugging and advanced profiling purposes.
                                 Activating this option provides detailed process insights,
                                 useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]      EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                 You can specify the number of threads.
                                 If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                 By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress              Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors              Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                 It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only              For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error        None-zero exit code on any StdErr message.
      --timestamp                Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                  Display timing and memory usage information.
      --output-mode=OUTPUT-MODE  Output format. Available options:
                                 text - Default text output format, userfriendly and easy to read.
                                 cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                 It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                 logstash - Logstash output format, for integration with ELK stack.
                                  [default: "text"]
      --cron                     Alias for --output-mode=cron. Deprecated!
  -h, --help                     Display help for the given command. When no command is given display help for the list command
  -q, --quiet                    Do not output any message
  -V, --version                  Display this application version
      --ansi|--no-ansi           Force (or disable --no-ansi) ANSI output
  -n, --no-interaction           Do not ask any interactive question
  -v|vv|vvv, --verbose           Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

./csv-blueprint dump-schema --help

CLICK to see debug-schema help messege
Description:
  Show the internal representation of the schema taking into account presets.

Usage:
  debug-schema [options]

Options:
  -s, --schema=SCHEMA            Specify the path to a schema file, supporting YAML, JSON, or PHP formats.
                                 Examples: /full/path/name.yml; p/file.yml
  -d, --hide-defaults            Hide default values in the output.
      --no-progress              Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors              Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                 It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only              For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error        None-zero exit code on any StdErr message.
      --timestamp                Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                  Display timing and memory usage information.
      --output-mode=OUTPUT-MODE  Output format. Available options:
                                 text - Default text output format, userfriendly and easy to read.
                                 cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                 It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                 logstash - Logstash output format, for integration with ELK stack.
                                  [default: "text"]
      --cron                     Alias for --output-mode=cron. Deprecated!
  -h, --help                     Display help for the given command. When no command is given display help for the list command
  -q, --quiet                    Do not output any message
  -V, --version                  Display this application version
      --ansi|--no-ansi           Force (or disable --no-ansi) ANSI output
  -n, --no-interaction           Do not ask any interactive question
  -v|vv|vvv, --verbose           Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

./csv-blueprint create-schema --help It's beta. Work in progress.

CLICK to see create-schema help messege
Description:
  Analyze CSV files and suggest a schema based on the data found.

Usage:
  create-schema [options]

Options:
  -c, --csv=CSV                      Specify the path(s) to the CSV files you want to analyze.
                                     This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
                                     Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
                                      (multiple values allowed)
  -H, --header[=HEADER]              Force the presence of a header row in the CSV files. [default: "yes"]
  -L, --lines[=LINES]                The number of lines to read when detecting parameters. Minimum is 1. [default: 10000]
  -C, --check-syntax[=CHECK-SYNTAX]  Check the syntax of the suggested schema. [default: "yes"]
  -r, --report=REPORT                Determines the report's output format.
                                     Available options: text, table, github, gitlab, teamcity, junit
                                      [default: "table"]
      --dump-schema                  Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                        Intended solely for debugging and advanced profiling purposes.
                                     Activating this option provides detailed process insights,
                                     useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]          EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                     You can specify the number of threads.
                                     If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                     By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress                  Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors                  Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                     It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only                  For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error            None-zero exit code on any StdErr message.
      --timestamp                    Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                      Display timing and memory usage information.
      --output-mode=OUTPUT-MODE      Output format. Available options:
                                     text - Default text output format, userfriendly and easy to read.
                                     cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                     It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                     logstash - Logstash output format, for integration with ELK stack.
                                      [default: "text"]
      --cron                         Alias for --output-mode=cron. Deprecated!
  -h, --help                         Display help for the given command. When no command is given display help for the list command
  -q, --quiet                        Do not output any message
  -V, --version                      Display this application version
      --ansi|--no-ansi               Force (or disable --no-ansi) ANSI output
  -n, --no-interaction               Do not ask any interactive question
  -v|vv|vvv, --verbose               Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug

Report examples

The validation process culminates in a human-readable report detailing any errors identified within the CSV file. While the default report format is a table, the tool supports various output formats, including text, GitHub, GitLab, TeamCity, JUnit, among others, to best suit your project's needs and your personal or team preferences.

GitHub Action format

To see user-friendly error outputs in your pull requests (PRs), specify report: github. This utilizes annotations to highlight bugs directly within the GitHub interface at the PR level. This feature allows errors to be displayed in the exact location within the CSV file, right in the diff of your Pull Requests. For a practical example, view this live demo PR.

GitHub Actions - PR

CLICK to see example in GitHub Actions terminal

GitHub Actions - Terminal

Text format

Optional format text with highlited keywords for qucik navigation.

Report - Text

Table format

When using the table format (default), the output is organized in a clear, easily interpretable table that lists all discovered errors. This format is ideal for quick reviews and sharing with team members for further action.

Table format

Notes

  • Report format for GitHub Actions is table by default.
  • Tools uses JBZoo/CI-Report-Converter as SDK to convert reports to different formats. So you can easily integrate it with any CI system.

Benchmarks

Understanding the performance of this tool is crucial, but it's important to note that its efficiency is influenced by several key factors:

  • File Size: The dimensions of the CSV file, both in terms of rows and columns, directly impact processing time. Performance scales linearly with file size and is dependent on the capabilities of your hardware, such as CPU and SSD speed.
  • Number of Rules: More validation rules per column mean more iterations for processing. Each rule operates independently, so the total time and memory consumption are cumulative across all rules.
  • Rule Intensity: While most validation rules are optimized for speed and low memory usage, some, like interquartile_mean, can be significantly slower. For instance, interquartile_mean might process around 4,000 lines per second, whereas other rules can handle upwards of 50 million lines per second.

However, to gain a general understanding of performance, refer to the table below.

  • All tests were conducted on a dataset comprising 2 million lines plus an additional line for the header.
  • These results are derived from the most current version, as verified by tests run using GitHub Actions (See workflow.yml). The link provides access to a variety of builds, which are essential for different testing scenarios and experiments. The most representative data can be found under Docker (latest, XX).
  • Developer mode was activated for these tests, using the flags -vvv --debug --profile.
  • Testing environment included the latest Ubuntu + Docker. For more information about the GitHub Actions (GA) hardware used, please see details about GA hardware.
  • The primary performance indicator is the processing speed, measured in lines per second. Note that speeds are presented in thousands of lines per second (100K equals 100,000 lines per second).
  • Peak RAM usage throughout the duration of each test case serves as an additional performance metric.

Profiles:

  • Quickest: Focuses on the fastest rules, either cell or aggregation, providing a baseline.
  • Minimum: Uses a set of normal performance rules, with two instances of each.
  • Realistic: Represents a mix of rules likely encountered in typical use cases.
  • All Aggregations: Tests all aggregation rules simultaneously, illustrating maximum load.

Divisions:

  • Cell Rules: Tests only individual cell validation rules.
  • Agg Rules: Focuses solely on column-wide aggregation rules.
  • Cell + Agg: Combines cell and aggregation rules for comprehensive validation.
  • Peak Memory: Indicates the maximum RAM usage, particularly relevant in scenarios with aggregation rules.

Note: The Peak Memory metric is primarily of interest when aggregation rules are used, as non-aggregation scenarios typically require no more than 2-4 megabytes of memory, regardless of file size or rule count.

These benchmarks offer a snapshot of the tool's capabilities across a range of scenarios, helping you gauge its suitability for your specific CSV validation needs.

File / Profile
Metric
Quickest Minimum Realistic All aggregations
Columns: 1
Size: ~8 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
786K,  2.5 sec
1187K,  1.7 sec
762K,  2.6 sec
52 MB
386K,  5.2 sec
1096K,  1.8 sec
373K,  5.4 sec
68 MB
189K, 10.6 sec
667K,  3.0 sec
167K, 12.0 sec
208 MB
184K, 10.9 sec
96K, 20.8 sec
63K, 31.7 sec
272 MB
Columns: 5
Size: 64 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
545K,  3.7 sec
714K,  2.8 sec
538K,  3.7 sec
52 MB
319K,  6.3 sec
675K,  3.0 sec
308K,  6.5 sec
68 MB
174K, 11.5 sec
486K,  4.1 sec
154K, 13.0 sec
208 MB
168K, 11.9 sec
96K, 20.8 sec
61K, 32.8 sec
272 MB
Columns: 10
Size: 220 MB


Cell rules
Agg rules
Cell + Agg
Peak Memory
311K,  6.4 sec
362K,  5.5 sec
307K,  6.5 sec
52 MB
221K,  9.0 sec
354K,  5.6 sec
215K,  9.3 sec
68 MB
137K, 14.6 sec
294K,  6.8 sec
125K, 16.0 sec
208 MB
135K, 14.8 sec
96K, 20.8 sec
56K, 35.7 sec
272 MB
Columns: 20
Size: 1.2 GB


Cell rules
Agg rules
Cell + Agg
Peak Memory
103K, 19.4 sec
108K, 18.5 sec
102K, 19.6 sec
52 MB
91K, 22.0 sec
107K, 18.7 sec
89K, 22.5 sec
68 MB
72K, 27.8 sec
101K, 19.8 sec
69K, 29.0 sec
208 MB
71K, 28.2 sec
96K, 20.8 sec
41K, 48.8 sec
272 MB

Additional Benchmark Insights:

When running the same validation tests on different hardware configurations, the performance of the tool can vary significantly. Notably, testing on a MacBook 14" M2 Max (2023) yields results that are approximately twice as fast as those observed on the GitHub Actions hardware. This indicates the tool's exceptional performance on modern, high-spec devices.

Conversely, tests conducted on a MacBook Pro (2019) with an Intel 2.4 GHz processor align closely with the GitHub Actions results, suggesting that the benchmark table provided reflects an average performance level for typical engineering hardware.

Brief conclusions

  • Cell Rules: These rules are highly CPU-intensive but require minimal RAM, typically around 1-2 MB at peak. The more cell rules applied to a column, the longer the validation process takes due to the additional actions performed on each value.

  • Aggregation Rules: These rules operate at incredible speeds, processing anywhere from 10 million to billions of rows per second. However, they are significantly more RAM-intensive. Interestingly, adding over 100 different aggregation rules does not substantially increase memory consumption.

  • PHP Array Functions: Not all PHP array functions can operate by reference (&$var). Whether or not a dataset in a column can be manipulated in this way is highly dependent on the specific algorithm used. For example, a 20 MB dataset might be duplicated during processing, leading to a peak memory usage of 40 MB. Consequently, optimization techniques that rely on passing data by reference are often ineffective.

  • Practical Implications: If processing a 1 GB file within 30-60 seconds is acceptable, and if there is 200-500 MB of RAM available, there may be little need to overly concern oneself with these performance considerations.

  • Memory Management: Throughout testing, no memory leaks were observed.

Examples of CSV Files

The CSV files utilized for benchmark testing are described below. These files were initially generated using PHP Faker to create the first 2000 lines. Subsequently, they were replicated 1000 times within themselves, allowing for the creation of significantly large random files in a matter of seconds.

A key principle observed in these files is that as the number of columns increases, the length of the values within these columns also tends to increase, following a pattern akin to exponential growth.

Columns: 1, Size: 8.48 MB
id
1
2
Columns: 5, Size: 64 MB
id,bool_int,bool_str,number,float
1,0,false,289566,864360.14285714
2,1,true,366276,444761.71428571
Columns: 10, Size: 220 MB
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4
1,1,true,779914,1101964.2857143,2011-02-04,"2000-03-02 00:33:57",erdman.net,germaine.brakus@yahoo.com,32.51.181.238
2,0,true,405408,695839.42857143,1971-01-29,"1988-08-12 21:25:27",bode.com,tatyana.cremin@yahoo.com,76.79.155.73
Columns: 20, Size: 1.2 GB
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge
1,1,false,884798,1078489.5714286,2006-02-09,"2015-12-07 22:59:06",gerhold.com,alisa93@barrows.com,173.231.203.134,5a2b6f01-0bac-35b2-bef1-5be7bb3c2d78,"776 Moises Coves Apt. 531; Port Rylan, DC 80810",10794,-69.908375,136.780034,78cb:75d9:4dd:8248:f190:9f3c:b0e:9afc,"Ea iusto non.","Qui sapiente qui ut nihil sit.","Modi et voluptate blanditiis aliquid iure eveniet voluptas facilis ipsum omnis velit.","Minima in molestiae nam ullam voluptatem sapiente corporis sunt in ut aut alias exercitationem incidunt fugiat doloribus laudantium ducimus iusto nemo assumenda non ratione neque labore voluptatem."
2,0,false,267823,408705.14285714,1985-07-19,"1996-11-18 08:21:44",keebler.net,wwolff@connelly.com,73.197.210.145,29e076ab-a769-3a1f-abd4-2bc73ab17c99,"909 Sabryna Island Apt. 815; West Matteoside, CO 54360-7141",80948,7.908256,123.666864,bf3b:abab:3dcb:c335:b1a:b5d6:60e9:107e,"Aut dolor distinctio quasi.","Alias sit ut perferendis quod at dolores.","Molestiae est eos dolore deserunt atque temporibus.","Quisquam velit aut saepe temporibus officia labore quam numquam eveniet velit aliquid aut autem quis voluptatem in ut iste sunt omnis iure laudantium aspernatur tenetur nemo consequatur aliquid sint nostrum aut nostrum."

Run benchmark locally

Make sure you have PHP 8.2+ and Docker installed.

# Clone the latest version
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint

# download dependencies and build the tool.
make build              # We need it to build benchmark tool. See `./tests/Benchmarks` folder.
make build-phar-file    # Optional. Only if you want to test it.
make docker-build       # Recommended. local tag is "jbzoo/csv-blueprint:local"

# Create random CSV files with 5 columns (max: 20).
BENCH_COLS=5 make bench-create-csv

# Run the benchmark for the recent CSV file.
BENCH_COLS=5 make bench-docker # Recommended
BENCH_COLS=5 make bench-phar
BENCH_COLS=5 make bench-php

# It's a shortcut that combines CSV file creation and Docker run.
# By default BENCH_COLS=10
make bench

Disadvantages?

The perception that PHP is inherently slow is a common misconception. However, with the right optimization strategies, PHP can perform exceptionally well. For evidence, refer to the article Processing One Billion CSV Rows in PHP!, which demonstrates that PHP can process, aggregate, and calculate data from CSV files at approximately 15 million lines per second! While not all optimizations are currently implemented, the performance is already quite impressive.

  • Yes, it's acknowledged that this tool might not be the fastest available, but it's also far from the slowest. For more details, see the link provided above.
  • Yes, the tool is built with PHP—not Python, Go, or PySpark—which may not be the first choice for such tasks.
  • Yes, it functions like a standalone binary. The recommendation is simply to use it without overthinking its internal workings.
  • Yes, it's recognized that this cannot be used as a Python SDK within a pipeline.

However, for the majority of scenarios, these are not deal-breakers. The utility effectively addresses the challenge of validating CSV files in continuous integration (CI) environments. 👍

This utility is designed for immediate use without necessitating a deep understanding of its inner mechanics. It adheres to rigorous testing standards, including strict typing, approximately seven linters and static analyzers at the highest rule level. Furthermore, every pull request is subjected to around ten different checks on GitHub Actions, spanning a matrix of PHP versions and modes, ensuring robustness. The extensive coverage and precautions are due to the unpredictability of usage conditions, embodying the spirit of the Open Source community.

In summary, the tool is developed with the highest standards of modern PHP practices, ensuring it performs as expected.

Coming soon

It's random ideas and plans. No promises and deadlines. Feel free to help me!.

CLICK to see the roadmap
  • Batch processing

    • If option --csv is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
  • Validation

    • Multi filename_pattern. Support list of regexs.
    • Multi values in one cell.
    • Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
    • Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
    • Configurable keyword for null/empty values. By default, it's an empty string. But you will use null, nil, none, empty, etc. Overridable on the column level.
    • Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
    • If option --schema is not specified, then validate only super base level things (like "is it a CSV file?").
    • Complex rules (like "if field A is not empty, then field B should be not empty too").
    • Extending with custom rules and custom report formats. Plugins?
    • Input encoding detection + BOM (right now it's experimental). It works but not so accurate... UTF-8 is the best choice for now.
  • Performance and optimization

    • Using vectors instead of arrays to optimaze memory usage and speed of access.
    • Multithreading support for parallel validation of CSV by columns.
  • Mock data generation

    • Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
    • Use Faker for random data generation.
    • ReverseRegex to generate text from regex.
  • Reporting

    • More report formats (like JSON, XML, etc). Any ideas?
    • Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
    • Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.
  • Misc

    • Install via brew on MacOS.
    • Install via apt on Ubuntu.
    • Use it as PHP SDK. Examples in Readme.
    • Warnings about deprecated options and features.
    • Add option --recomendation to show a list of recommended rules for the schema or potential issues in the CSV file or schema. It's useful when you are not sure what rules to use.
    • Add option --error=[level] to show only errors with a specific level. It's useful when you have a lot of warnings and you want to see only errors.
    • More examples and documentation.

PS. There is a file with my ideas and imagination. It's not valid schema file, just a draft. I'm not sure if I will implement all of them. But I will try to do my best.

Contributing

If you have any ideas or suggestions, feel free to open an issue or create a pull request.

# Fork the repo and build project
git clone git@github.com:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint
cd ./jbzoo-csv-blueprint
make build

# Make your local changes

# Autofix code style 
make test-phpcsfixer-fix test-phpcs

# Run all tests and check code style
make test
make codestyle

# Create your pull request and check all tests in CI (Github Actions)
# ???
# Profit!

License

MIT License: It's like free pizza - enjoy it, share it, just don't sell it as your own. And remember, no warranty for stomach aches! 😅

See also

  • Cli - Framework helps create complex CLI apps and provides new tools for Symfony/Console.
  • CI-Report-Converter - It converts different error reporting standards for popular CI systems.
  • Composer-Diff - See what packages have changed after composer update.
  • Composer-Graph - Dependency graph visualization of composer.json based on Mermaid JS.
  • Mermaid-PHP - Generate diagrams and flowcharts with the help of the mermaid script language.
  • Utils - Collection of useful PHP functions, mini-classes, and snippets for every day.
  • Image - Package provides object-oriented way to manipulate with images as simple as possible.
  • Data - Extended implementation of ArrayObject. Use Yml/PHP/JSON/INI files as config. Forget about arrays.
  • Retry - Tiny PHP library providing retry/backoff functionality with strategies and jitter.
just interesting fact

I've achieved a personal milestone. The initial release of the project was crafted from the ground up in approximately 3 days, interspersed with regular breaks to care for a 4-month-old baby. Reflecting on the first commit and the earliest git tag, it's clear that this was accomplished over a weekend, utilizing spare moments on my personal laptop.