Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get db metadata to check for completeness #313

Open
lidiabressan opened this issue Jul 3, 2023 · 6 comments
Open

get db metadata to check for completeness #313

lidiabressan opened this issue Jul 3, 2023 · 6 comments
Labels

Comments

@lidiabressan
Copy link

ciao,

I would like to check if my db is complete or how much is complete or has missing fields. The db is grib, for a couple of years, selected model levels and variables.

As far as I understand, my solutions are:

  1. dump all metadata and check everything with arki-query --dump and then process the text (multiple lines for each datum, very long text for a couple of years db, ...),
  2. loop over products, origin, reftimes, timeranges with arki-query --dump --summary --summary-restrict levels and check everything (shorter text but same procedure multiple times).

Is there a way to get an output of only the unique parameters of metadata on a row?

Do you have any advice?

Thanks

Lidia

@spanezz
Copy link
Contributor

spanezz commented Jul 4, 2023

I understand that you need to check if, in an arkimet dataset, for each day you have data for a whole set of products and levels, on each day (or every 6 hours, or the actual model output interval). Is that understanding correct?

@lidiabressan
Copy link
Author

lidiabressan commented Jul 4, 2023

yes, hourly analysis and forecast, different variables, some superficial, some on different levels (and we go back to 2018, still the reference year for aq).
I know in advance variables with levels, and the origin discriminates between analysis and forecast.

@spanezz
Copy link
Contributor

spanezz commented Jul 4, 2023

I cannot think of any existing functionality to do something that specific out of the box, and it should be reasonably doable with a bit of Python.

This is an example script that queries a dataset at regular intervals and checks that there is some data for all intervals. It could be a good base from which to build to check your required combinations of metadata:

#!/usr/bin/python3
import argparse
import datetime
from collections import defaultdict

import arkimet as arki


class Instant:
    def __init__(self):
        self.levels = set()
        self.products = set()


class Checker:
    def __init__(self):
        self.instants = defaultdict(Instant)

    def on_metadata(self, md):
        reftime = md.to_python("reftime")["time"]
        instant = self.instants[reftime]
        try:
            instant.levels.add(md["level"])
        except KeyError:
            pass
        try:
            instant.products.add(md["product"])
        except KeyError:
            pass

    def report(self):
        begin = min(self.instants)
        until = max(self.instants)

        cur = begin
        while cur <= until:
            try:
                instant = self.instants.get(cur)
                if instant is None:
                    print("data missing for reftime", cur)
                    continue
                if not instant.levels:
                    print("levels missing at reftime", cur)
                if not instant.products:
                    print("products missing at reftime", cur)
            finally:
                cur = cur + datetime.timedelta(hours=1)


def main():
    parser = argparse.ArgumentParser(description="check a dataset for completeness")
    parser.add_argument("dataset", action="store", help="Path to the dataset")
    args = parser.parse_args()

    checker = Checker()

    with arki.dataset.Session() as session:
        cfg = arki.dataset.read_config(args.dataset)
        with session.dataset_reader(cfg=cfg) as ds:
            ds.query_data("reftime:every 1 hour", on_metadata=checker.on_metadata)

    checker.report()


if __name__ == "__main__":
    main()

@lidiabressan
Copy link
Author

Thanks !

I tried to adapt it to my case, but I struggled with the documentation about the metadata and I have a couple of questions:

from various prints (print(dir(md))), I discovered:

  • md.to_python gives a dictionary. Which are the dictionary keys to get the values? I coul not find them.
  • md.to_string gives a string, which is however different from the string used for queries: strings for queries, also in python, are as in the arkimet command line "GRIB1,x,x,x" but python api returns different strings "GRIB1(00x, 00x, 00x)"? could they be used for queries too ?
  • md["level"] and md.to_string("level") is the same ?
  • can I get values too or should I extract them from the dictionary ?
    Are these in the documentation ? Where could I find them ?

About the dataset:

with arki-query I can get information also about a grib file. Can i use this script with dataset = grib: file.grib as in the command line ? I tried but could not make it work.

thanks

@spanezz
Copy link
Contributor

spanezz commented Aug 10, 2023

As a general pointer, which doesn't answer your questions at the moment, the existing documentation for the Metadata class in Python can be found here: https://arpa-simc.github.io/arkimet/python/arkimet.html#arkimet.Metadata

The dictionary keys for to_python are different for each metadata type (origin, level, ...) and for each style of metadata type (grib product, bufr product, ...). There is no detailed documentation of the representation as a dictionary, and print() is currently the best way to explore their layout. Some general documentation of metadata types and styles can be found at https://arpa-simc.github.io/arkimet/metadata.html

md.to_string gives the string one sees in arki-query --yaml, which are indeed different than what one could use for queries, althoug there are many things in common since the queries need to match the data that one sees in arki-query --yaml. The syntax of queries is documented here: https://arpa-simc.github.io/arkimet/matcher.html

md["level"] and md.to_string("level") are the same, yes

If you need values for levels you can extract them from the dictionary of to_python, depending what you need them for. Level information as stored by arkimet, as with any other metadata types, are the bare minimum that arkimet can use to distinguish data in the datasets. They might not be comprehensive descriptions, although they tend to contain useful information

You should be able to open a grib file as if it were a dataset by passing its path to read_config. For example:

import arkimet

with arkimet.dataset.Session() as session:
    cfg = arkimet.dataset.read_config("file.grib")
    ...

@lidiabressan
Copy link
Author

one more question:

if arki-query command-line, I can query more dataset by listing them (arki-query '' dataset1 dataset2).

Can I put a list of dataset in args.dataset?

with arki.dataset.Session() as session:
cfg = arki.dataset.read_config(args.dataset)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants