acdh-oeaw / arche-metadata-crawler
Script and library for checking and generating ARCHE metadata in ACDH schema
0.8.1
2024-04-09 18:14 UTC
Requires
- php: >=8.1 <8.3
- acdh-oeaw/arche-assets: ^3.9.4
- acdh-oeaw/arche-lib-ingest: ^4
- acdh-oeaw/arche-lib-schema: ^7
- phpoffice/phpspreadsheet: ^1.29
- zozlak/argparse: ^1.0
- zozlak/logging: ^1.0
Requires (Dev)
- phpstan/phpstan: ^1
README
Functionality
A set of scripts:
- Merging metadata of a collection from inputs in various formats
- Validating the merged metadata
- Generating XLSX metadata templates based on the current ontology (see the horizontal metadata files in metadata formats description)
used for the metadata curation during ARCHE ingestions.
Installation
Locally
- Install PHP 8 and composer
- Run:
composer require acdh-oeaw/arche-metadata-crawler
As a docker image
- Install docker.
- Run the
acdhch/repo-file-checker
image mounting your data directory into it:docker run --rm -ti --entrypoint bash -u `id -u`:`id -g` \ -v pathToYourDataDir:/data \ acdhch/repo-file-checker
- Run the scripts, e.g.
/opt/vendor/bin/arche-create-metadata-template /data al
and/opt/vendor/bin/arche-crawl-meta \ /data/metadata \ /data/merged.ttl \ /ARCHE/staging/GlaserDiaries_16674/data \ https://id.acdh.oeaw.ac.at/glaserdiaries
- if you need the file-checker,
it is available under
/opt/vendor/bin/arche-filechecker
- if you need the file-checker,
it is available under
On ACDH Cluster
Nothing to be done. It is installed there already.
Usage
(For a full walk-trough using repo-ingestion@hephaistos and the Wollmilchsau test collection please look here)
On ACDH Cluster
First, get the arche-ingestion workload console by:
- Opening this link (if you are redirected to the login page, open the link once again after you log in)
- Clicking on the bluish button with three vertical dots in the top-right corner of the screen and and choosing
> Execute Shell
Then:
- Generate and validate the metadata:
- Open a screen session (the shell disconnects after one minute of inactivity) with
screen
- If you need to reconnect to the screen session because it was disconnected, run
screen -rd
- If you need to reconnect to the screen session because it was disconnected, run
- Run the
arche-crawl-meta
script:/ARCHE/vendor/bin/arche-crawl-meta \ <pathToMetadataDirectory> \ --filecheckerReportDir <pathToTheFileCheckerReportDirectory> \ <outputTtlPath> \ <basePathOfTheCollection> \ <idPrefix> \ 2>&1 | tee <pathToLogFile>
e.g./ARCHE/vendor/bin/arche-crawl-meta \ /ARCHE/staging/GustavMahlerArchiv_22334/metadata \ --filecheckerReportDir /ARCHE/staging/GustavMahlerArchiv_22334/checkReports/2024_04_08_09_19_24 \ /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.ttl \ /ARCHE/staging/GustavMahlerArchiv_22334/data \ https://id.acdh.oeaw.ac.at/GustavMahlerArchiv \ 2>&1 | tee /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.log
- If you are want to skip the checks (which speeds up the process significantly), add the
--noCheck
parameter, e.g./ARCHE/vendor/bin/arche-crawl-meta \ /ARCHE/staging/GustavMahlerArchiv_22334/metadata \ --filecheckerReportDir /ARCHE/staging/GustavMahlerArchiv_22334/checkReports/2024_04_08_09_19_24 \ /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.ttl \ /ARCHE/staging/GustavMahlerArchiv_22334/data \ https://id.acdh.oeaw.ac.at/GustavMahlerArchiv \ --noCheck \ 2>&1 | tee /ARCHE/staging/GustavMahlerArchiv_22334/scriptFiles/metadata.log
- If you are want to skip the checks (which speeds up the process significantly), add the
- Open a screen session (the shell disconnects after one minute of inactivity) with
- Create metadata templates:
/ARCHE/vendor/bin/arche-create-metadata-template \ <pathToDirectoryWhereTemplateShouldBeCreated> \ all
e.g. to create templates in the current directory/ARCHE/vendor/bin/arche-create-metadata-template . all
Locally
- Generating and validaing the metadata:
vendor/bin/arche-crawl-meta \ --filecheckerOutput <pathTo_fileList.json_generatedBy_repo-filechecker> \ <pathToCollectionData> \ <pathToTargetMetadataFile>
e.g.vendor/bin/arche-crawl-meta \ metaDir \ metadata.ttl `pwd`/data https://id.acdh.oeaw.ac.at/myCollection
- Creating metadata templates:
vendor/bin/arche-create-metadata-template \ <pathToDirectoryWhereTemplateShouldBeCreated> \ all
e.g. to create templates in the current directorybin/arche-create-metadata-template . all
Remarks:
- To get a list of all available parameters run:
vendor/bin/arche-crawl-meta --help vendor/bin/arche-create-metadata-template --help
As a docker container
- Creating metadata templates:
Run a container mounting directory where templates should be created under
/mnt
inside the container:docker run \ --rm -u `id -u`:`id -g`\ -v <pathToDirectoryWhereTemplateShouldBeCreated:/mnt \ acdhch/repo-file-checker createTemplate all
e.g. to create the templates in the current directorydocker run \ --rm -u `id -u`:`id -g` -v `pwd`:/mnt acdhch/repo-file-checker createTemplate all