Harvard Library Bibliographic Dataset parser

Harvard University publicly releases the 12+ million metadata records from their library catalogs, including photos, journals, books, recordings and manuscripts. Known as the Harvard Library Bibliographic Dataset, documentation is available here.

The dataset is in the public domain, but the data is in the arcane and wonky "MARC21" format (see the Library of Congress' official documentation on MARC21).

This is a parser for the data contained in the dump that can output in either JSON or SQL. It makes use of Nathan Denny's MARC21 library, and adds a ton of stuff on top, including friendly names for fields, and lots of heuristic tricks to determine the content type of the items, which opens up even more metadata encoded in the infamous "Record 008".

Usage

   python marc.py sql|json [input] > your_output_file.json

Download and uncompress the Harvard dataset in the same directory (this script expects the data to be in data/hlom/). Alternatively, you can specify an individual source file as the input argument.

For example:

   python marc.py json > output.json

Also included are samples of the JSON and SQL output.

License

Released under the "MIT" or "BSD" license scheme. See LICENSE file.

TODO

The SQL schema is terrible!
More metadata parsing
Better detection of music and audio
Translate more fields
alt_glyph support

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
harvard.sample.json		harvard.sample.json
harvard.sample.sql		harvard.sample.sql
marc.py		marc.py
marc21.py		marc21.py
marc_data.py		marc_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harvard Library Bibliographic Dataset parser

Usage

License

TODO

About

Releases

Packages

Contributors 2

Languages

License

aristus/copymine-harvard

Folders and files

Latest commit

History

Repository files navigation

Harvard Library Bibliographic Dataset parser

Usage

License

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages