This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

eark-project / dm-etl Public archive

Notifications You must be signed in to change notification settings
Fork 1
Star 0

Tool of the full-scale E-ARK deployment to extract content from AIPs and load content into Lily.

Apache-2.0 license

0 stars 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
archive_search		archive_search
cluster_scripts		cluster_scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
archetype_info.txt		archetype_info.txt
commands.txt		commands.txt
mrjob-assembly.xml		mrjob-assembly.xml
pom.xml		pom.xml

Repository files navigation

dm-etl

eArk WP6 - reference implementation: ETL AIPs into Lily

TODOs

extract content from AIP
extract content from containers inside AIP (e.g. WARC)
load content on "content level" into lily
create SOLR and other configurations, commit into src/main/config/
CLI interface to load data into lily
experiment with full text index on high volumes (WARC data)
maybe develop MR job to run this on all AIPs in HDFS

Future

create pig query files in src/main/resources
understand metadata (e.g. METS)
understand content (e.g. text inside MS Office)
experiment with big CSV from db in Solr
integration of denormalized data bases

Not Goal

no query, query is another project

About

Tool of the full-scale E-ARK deployment to extract content from AIPs and load content into Lily.

Apache-2.0 license

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Contributors 3

Languages