Data Transformation Pipeline Code

Architecture

Pipeline Components

Future ITS Owned Components:

Harvester: ActivityStreams based record harvester that stores to a Cache.
Cache: Record cache for storing local copies of JSON data. Currently postgres or filesystem.
IdMap: Identifier Map that mints, manages and retrieves external/internal identifier sets. Currently redis or in-memory.

Pipeline Components:

config: Configuration of the pipeline, as JSON records in a cache.
process/Collector: Recursively collects identifiers for a given record.
process/Reconciler: Reconcile records
process/Merger: Merges two Linked Art records representing the same entity together.
process/Reidentifier: Recursively external rewrite URIs in a record to internal identifiers, given an IdMap.
process/ReferenceManager: Manages inter-record connections
process/UpdateManager: Manages harvesting and updates to caches
sources/*/Acquirer: Wraps Fetcher and Mapper to acquire a record either from the network or a cache
sources/*/Fetcher: Fetches identified record from external source to a Cache.
sources/*/Harvester: Retrieve multiple records via IIIF Change Discovery or OAI-PMH
sources/*/Mapper: Maps from source format into Linked Art, or from Linked Art to discovery layer format (Marklogic, QLever)
sources/*/Reconciler: Determine if the entity in the given record is described in the external source.
sources/*/Loader: Load a dump of the data into the data cache.
sources/*/IndexLoader: Create an inverted index to reconcile records against this dataset.
storage/MarkLogic: Marklogic storage of data
storage/Cache: Data caches (Postgresql, file system)
storage/Idmap: Key/value store API (Redis, file system, in-memory)

External Sources: Implementation Status

Source	Fetch	Map	Name Reconcile	Load	IdxLoad
AAT	✅	✅	✅	N/A	-
DNB	✅	✅	-	✅	-
FAST	✅	-	-	-	-
Geonames	✅	✅	-	✅	-
LCNAF	✅	✅	✅	✅	-
LCSH	✅	✅	✅	✅	✅
TGN	✅	✅	-	N/A	-
ULAN	✅	✅	✅	N/A	-
VIAF	✅	✅	-	✅	-
Who's on First	✅	✅	-	N/A	-
Wikidata	✅	✅	-	✅	✅
Japan NL	✅	✅	-	N/A	-
BNF	✅	✅	-	N/A	-
GBIF	✅	✅	-	N/A	-
ORCID	✅	✅	-	N/A	-
ROR	✅	✅	-	N/A	-
Wikimedia API	✅	✅	-	N/A	-
DNB	✅	✅	-	N/A	-
BNE	✅	✅	-	N/A	-
Nomenclature	-	-	-	-	-
Getty Museum	-	-	-	-	-
Homosaurus	-	-	-	-	-
Nomisma	-	-	-	-	-
SNAC	-	-	-	-	-

✅ = Done ; - = Not started ; N/A = Can't/Won't be done

AAT, TGN, ULAN: Dump files are NTriples based. More effort to reconstruct than it would be worth. Instead we can use IIIF Change Discovery to synchronize.
WOF: Dump file is a 33Gb sqlite db... we just use it as the cache directly
FAST: Not implemented yet (needs to process MARC/XML)

Fetching external source dump files

Process:

In the config file, look up dumpFilePath and remoteDumpFile
Go to the directory where dumpFilePath exists and rename it with a date (e.g. latest-2022-07)
execute wget <url> where <url> is the URL from remoteDumpFile (and probably validate it by hand online)
For wikidata, as it's SO HUGE, instead do: nohup wget --quiet <url> & to fetch it in the background so we can get on with our lives in the mean time.
Done :)

Name		Name	Last commit message	Last commit date
Latest commit History 1,019 Commits
docs		docs
experiments		experiments
pipeline		pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
backup_tables.sh		backup_tables.sh
checkDataUpdates.py		checkDataUpdates.py
debug-reconcile.py		debug-reconcile.py
export_parallel.sh		export_parallel.sh
full-build.sh		full-build.sh
google-sames-diffs.py		google-sames-diffs.py
harvest_parallel.sh		harvest_parallel.sh
import_parallel.sh		import_parallel.sh
load-csv-map2.py		load-csv-map2.py
load_parallel.sh		load_parallel.sh
make-wd-differents.py		make-wd-differents.py
make_test_dataset.py		make_test_dataset.py
manage-data.py		manage-data.py
merge-metatypes.py		merge-metatypes.py
merge_parallel.sh		merge_parallel.sh
nt_parallel.sh		nt_parallel.sh
populate-timestamps.py		populate-timestamps.py
reconcile_parallel.sh		reconcile_parallel.sh
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
run-all.sh		run-all.sh
run-export.py		run-export.py
run-harvest.py		run-harvest.py
run-load.py		run-load.py
run-merge.py		run-merge.py
run-reconcile.py		run-reconcile.py
updateExternal.py		updateExternal.py
updateExternal_parallel.sh		updateExternal_parallel.sh
watch-mlcp.py		watch-mlcp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Transformation Pipeline Code

Architecture

Pipeline Components

Future ITS Owned Components:

Pipeline Components:

External Sources: Implementation Status

Fetching external source dump files

About

Releases

Packages

Contributors 2

Languages

License

project-lux/data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Transformation Pipeline Code

Architecture

Pipeline Components

Future ITS Owned Components:

Pipeline Components:

External Sources: Implementation Status

Fetching external source dump files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages