Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.
/ dm-etl Public archive

Tool of the full-scale E-ARK deployment to extract content from AIPs and load content into Lily.

License

Notifications You must be signed in to change notification settings

eark-project/dm-etl

Repository files navigation

dm-etl

eArk WP6 - reference implementation: ETL AIPs into Lily

TODOs

  • extract content from AIP
  • extract content from containers inside AIP (e.g. WARC)
  • load content on "content level" into lily
  • create SOLR and other configurations, commit into src/main/config/
  • CLI interface to load data into lily
  • experiment with full text index on high volumes (WARC data)
  • maybe develop MR job to run this on all AIPs in HDFS

Future

  • create pig query files in src/main/resources
  • understand metadata (e.g. METS)
  • understand content (e.g. text inside MS Office)
  • experiment with big CSV from db in Solr
  • integration of denormalized data bases

Not Goal

  • no query, query is another project

About

Tool of the full-scale E-ARK deployment to extract content from AIPs and load content into Lily.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published