Set of static assets used (mainly) for ARCHE data preprocessing

3.14.1 2024-03-11 10:00 UTC


PyPI version codecov Test flake8 Lint Latest Stable Version phpunit License

Set of static assets used (mainly) for ARCHE data preprocessing or ARCHE information pages:

  • URI normalization rules used within the ACDH-CH.
    (stored in AcdhArcheAssets/uriNormRules.json)
  • Description of input data formats accepted by ARCHE.
    (stored in AcdhArcheAssets/formats.json)

The repository provides also Python 3 and PHP bindings for accessing those assets.

Installation & usage


  • Install using pip3:
    pip3 install acdh-arche-assets
  • Use with
    from AcdhArcheAssets.uri_norm_rules import get_rules, get_normalized_uri, get_norm_id
    wrong_id = ""
    good_id = get_normalized_uri(wrong_id)
    # ""
    # extract ID from URL
    norm_id = get_norm_id("")
    # "1232324343"
    from AcdhArcheAssets.file_formats import get_formats, get_by_mtype, get_by_extension
    formats = get_formats()
    matching_mapping = get_by_mtype('image/png')
    matching_mapping = get_by_extension('png')


  • Install using using composer:
    composer require acdh-oeaw/arche-assets
  • Usage with
    require_once 'vendor/autoload.php';
    print_r(acdhOeaw\UriNormRules::getRules(['viaf', 'gnd']));

Description of assets

URI normalization rules

Each rule consists of five properties:

  • name: a rule name
  • match: a regular expression matching a given URI namespace
  • replace: a regular expression replace expression normalizing an URI in a given namespace
  • resolve: a regular expression replace expression transforming an URI in a given namespace to an URL fetching an RDF data
  • format: a RDF serialization format to be requested while resolving the URL produced using the resolve field


A curated and growing list of file extensions. For each file extension mappings to the respective ARCHE Resource Type Category (stored in acdh:hasCategory) and Media Type (MIME type) (stored in acdh:hasFormat) are given. The indicated Media Type should only be used as a fallback; it is best practice to rely on automated Media Type detection based on file signatures.

Further information is provided as well.

  • fileExtension: File extension to be mapped.
  • name: Name(s) the format is known
  • archeCategory: The corresponding URI of the ARCHE Resource Type Category Vocabulary
  • dataType: A broad category to group formats in; mainly intended for visualisation purposes.
  • pronomID: ID(s) assigned by PRONOM
  • mimeType: Official Media Type(s) (formerly known as MIME types) registered at IANA.
  • informalMimeType: Other MIME types kown for the format
  • magicNumber: A constant numerical or text value used to identify a file format, e.g. Wikipedia list of file signatures
  • ianaTemplate: Link to template at IANA
  • reference: Link(s) to format specifications referenced by IANA and others
  • longTerm: Indicates if a format is suitable for long-term preservation.
    Possible values and their meaning
    • yes - long-term format
    • no - not suitable, another format should be used
    • restricted - can be used for long-term preservation in some cases (see comment)
    • unsure - status remains to be evaluated
  • archeDocs: Link to a place with more information for the format.
  • comment: Any other noteworthy information not stated elsewhere.

Developement (Python)

install needed developement packages pip install requirements_dev.txt

linting, tests and testcoverage

  • to run the test: tox
  • check coverage and create report: coverage run test and coverage html
  • check linting flake8