Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DAG facilitating nested DataCatalog structure #648

Open
felixschmitz opened this issue Oct 31, 2024 · 5 comments · May be fixed by #650
Open

ENH: DAG facilitating nested DataCatalog structure #648

felixschmitz opened this issue Oct 31, 2024 · 5 comments · May be fixed by #650
Labels
enhancement New feature or request

Comments

@felixschmitz
Copy link

Is your feature request related to a problem?

When using a nested DataCatalog of the kind

from pytask import DataCatalog


MODEL_NAMES = ("ols", "logistic_regression")
DATA_NAMES = ("data_1", "data_2")


nested_data_catalogs = {
    model_name: {
        data_name: DataCatalog(name=f"{model_name}-{data_name}")
        for data_name in DATA_NAMES
    }
    for model_name in MODEL_NAMES
}

and adding products to a DataCatalog e.g. via the following task:

from pathlib import Path
from pytask import task
from typing_extensions import Annotated

from my_project.config import DATA_NAMES
from my_project.config import MODEL_NAMES
from my_project.config import nested_data_catalogs


for model_name in MODEL_NAMES:
    for data_name in DATA_NAMES:

        @task
        def fit_model(
            path: Path = Path("...", data_name)
        ) -> Annotated[
            Any, nested_data_catalogs[model_name][data_name]["fitted_model"]
        ]:
            data = ...
            fitted_model = ...
            return fitted_model

as described in the extended DataCatalog guide, I would expect the DAG to facilitate the nested structure of the DataCatalog.

For now the PickleNode's name, "fitted_model" in the example, is only used in the representation of the DAG. When having multiple models and datasets, the information "fitted_model" is on the one hand insufficient, and on the other hand, produces a DAG which implies the wrong structure and dependencies.

Describe the solution you'd like

I would want the DAG to facilitate the nested structure of the DataCatalog and not only use the PickleNode's name. One approach would be to display in the DAG the name of the DataCatalog and the PickleNode, e.g. ols1-data_1-fitted_model. Another approach would be to use the key values of nested_data_catalogs and join these with the PickleNode's name, producing a similar result in the example above, but guaranteeing a more informative name in general.

@felixschmitz felixschmitz added the enhancement New feature or request label Oct 31, 2024
@tobiasraabe
Copy link
Member

tobiasraabe commented Nov 2, 2024

Hi @felixschmitz, what you are discussing is only a cosmetic issue when creating an image of the dag with pytask dag, right?

Using the data catalog name to resolve duplicates sounds like a reasonable idea. pytask does something related to nodes with path names by removing as many prefixes as possible while keeping the name unique.

Using the nested dictionary and reading the prefixes from the keys is only possible if there is a create_data_catalogs_from_dictionary() function or something similar that consumes the whole dictionary. But then we are pretty much at something that everyone can write for themselves if they need it. I do not see much benefit in supporting this as a special case.

Do you want to fix it? I could give you some guidance.

@felixschmitz
Copy link
Author

Thanks & apologies for my late response.
I do agree that the issue is only on the cosmetic level, when creating the image of the dag with pytask dag.

Yet, I disagree with your elaboration on what pytask is doing "related to nodes with path names by removing as many prefixed as possible while keeping the name unique". The result I am observing, is that in the image of the dag, the element of a datacatalog gets shortened to "fitted_model", which is non-unique across the datacatalogs. Hence, all tasks point towards "fitted_model" and suggest there is a single (shared) product. My point is that the name is non-unique, when it should be. Do you agree on that?

@tobiasraabe
Copy link
Member

I don't think we have a disagreement here 😄. Maybe my comment was confusing and going too much on a tangent. The comment was about related functionality to keep displayed names short but not about the specific issue. Long names easily pollute the whole command line interface.

When it comes to the data catalog node names in the DAG, pytask just displays the names of the nodes which are the keys to the data catalog without any shortening.

In the linked PR I have a fix ready, I just have to think about whether I like it.

BTW: Please, consider next time to post the command you ran plus the image of the DAG. It would have helped to understand the issue even better.

@felixschmitz
Copy link
Author

Alright, that helped! Thanks a lot. I'd be happy if the feature (or a similar solution) made it into the package.
Will also keep your advice on issue specification in mind.

@hmgaudecker
Copy link
Contributor

I second that the DAG representation can be very misleading, so some fix would be great!

While you are at the data catalog, would it be possible to bring entries back? Maybe as a method to make clear it is just for reading, not setting? I know that we have _entries, but it always feels bad to tell ruff to exclude things 😇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants