ENH: DAG facilitating nested DataCatalog structure #648

felixschmitz · 2024-10-31T13:55:52Z

Is your feature request related to a problem?

When using a nested DataCatalog of the kind

from pytask import DataCatalog


MODEL_NAMES = ("ols", "logistic_regression")
DATA_NAMES = ("data_1", "data_2")


nested_data_catalogs = {
    model_name: {
        data_name: DataCatalog(name=f"{model_name}-{data_name}")
        for data_name in DATA_NAMES
    }
    for model_name in MODEL_NAMES
}

and adding products to a DataCatalog e.g. via the following task:

from pathlib import Path
from pytask import task
from typing_extensions import Annotated

from my_project.config import DATA_NAMES
from my_project.config import MODEL_NAMES
from my_project.config import nested_data_catalogs


for model_name in MODEL_NAMES:
    for data_name in DATA_NAMES:

        @task
        def fit_model(
            path: Path = Path("...", data_name)
        ) -> Annotated[
            Any, nested_data_catalogs[model_name][data_name]["fitted_model"]
        ]:
            data = ...
            fitted_model = ...
            return fitted_model

as described in the extended DataCatalog guide, I would expect the DAG to facilitate the nested structure of the DataCatalog.

For now the PickleNode's name, "fitted_model" in the example, is only used in the representation of the DAG. When having multiple models and datasets, the information "fitted_model" is on the one hand insufficient, and on the other hand, produces a DAG which implies the wrong structure and dependencies.

Describe the solution you'd like

I would want the DAG to facilitate the nested structure of the DataCatalog and not only use the PickleNode's name. One approach would be to display in the DAG the name of the DataCatalog and the PickleNode, e.g. ols1-data_1-fitted_model. Another approach would be to use the key values of nested_data_catalogs and join these with the PickleNode's name, producing a similar result in the example above, but guaranteeing a more informative name in general.

The text was updated successfully, but these errors were encountered:

tobiasraabe · 2024-11-02T12:44:00Z

Hi @felixschmitz, what you are discussing is only a cosmetic issue when creating an image of the dag with pytask dag, right?

Using the data catalog name to resolve duplicates sounds like a reasonable idea. pytask does something related to nodes with path names by removing as many prefixes as possible while keeping the name unique.

Using the nested dictionary and reading the prefixes from the keys is only possible if there is a create_data_catalogs_from_dictionary() function or something similar that consumes the whole dictionary. But then we are pretty much at something that everyone can write for themselves if they need it. I do not see much benefit in supporting this as a special case.

Do you want to fix it? I could give you some guidance.

felixschmitz · 2024-11-09T15:48:50Z

Thanks & apologies for my late response.
I do agree that the issue is only on the cosmetic level, when creating the image of the dag with pytask dag.

Yet, I disagree with your elaboration on what pytask is doing "related to nodes with path names by removing as many prefixed as possible while keeping the name unique". The result I am observing, is that in the image of the dag, the element of a datacatalog gets shortened to "fitted_model", which is non-unique across the datacatalogs. Hence, all tasks point towards "fitted_model" and suggest there is a single (shared) product. My point is that the name is non-unique, when it should be. Do you agree on that?

tobiasraabe · 2024-11-09T17:42:42Z

I don't think we have a disagreement here 😄. Maybe my comment was confusing and going too much on a tangent. The comment was about related functionality to keep displayed names short but not about the specific issue. Long names easily pollute the whole command line interface.

When it comes to the data catalog node names in the DAG, pytask just displays the names of the nodes which are the keys to the data catalog without any shortening.

In the linked PR I have a fix ready, I just have to think about whether I like it.

BTW: Please, consider next time to post the command you ran plus the image of the DAG. It would have helped to understand the issue even better.

felixschmitz · 2024-11-10T10:44:11Z

Alright, that helped! Thanks a lot. I'd be happy if the feature (or a similar solution) made it into the package.
Will also keep your advice on issue specification in mind.

hmgaudecker · 2024-11-12T17:21:09Z

I second that the DAG representation can be very misleading, so some fix would be great!

While you are at the data catalog, would it be possible to bring entries back? Maybe as a method to make clear it is just for reading, not setting? I know that we have _entries, but it always feels bad to tell ruff to exclude things 😇

felixschmitz added the enhancement New feature or request label Oct 31, 2024

tobiasraabe linked a pull request Nov 9, 2024 that will close this issue

Add data catalog name to nodes to make representation unique. #650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: DAG facilitating nested DataCatalog structure #648

ENH: DAG facilitating nested DataCatalog structure #648

felixschmitz commented Oct 31, 2024

tobiasraabe commented Nov 2, 2024 •

edited

Loading

felixschmitz commented Nov 9, 2024

tobiasraabe commented Nov 9, 2024

felixschmitz commented Nov 10, 2024

hmgaudecker commented Nov 12, 2024

ENH: DAG facilitating nested DataCatalog structure #648

ENH: DAG facilitating nested DataCatalog structure #648

Comments

felixschmitz commented Oct 31, 2024

Is your feature request related to a problem?

Describe the solution you'd like

tobiasraabe commented Nov 2, 2024 • edited Loading

felixschmitz commented Nov 9, 2024

tobiasraabe commented Nov 9, 2024

felixschmitz commented Nov 10, 2024

hmgaudecker commented Nov 12, 2024

tobiasraabe commented Nov 2, 2024 •

edited

Loading