Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: cloud paths checking against patterns #1094

Merged
merged 6 commits into from
Nov 11, 2024

Conversation

akhanf
Copy link
Contributor

@akhanf akhanf commented Oct 14, 2024

The _check_path_matches_patterns function was failing for cloud (e.g. s3, gc3) URIs (added in #1074):

  1. the path.relative_to() function doesn't work for cloud URIs
    (it just returns the original).
  2. combining Path('/') with the result of relative_to() now (ie since universal_pathlib > 0.2.3) throws
    an error when using cloud paths.

So if using universal_pathlib >0.2.3, any cloud URIs as bids datasets would throw an error message.

This fix drops the uri prefix by using .path, so that the regex check works as expected.

The _check_path_matches_patterns function was failing for cloud (e.g.
s3, gc3) URIs.

 1) the path.relative_to() function doesn't work for cloud URIs
(it just returns the original).
 2) combining Path('/') with the result of relative_to now also throws
an error when using cloud (or non-posix) paths.

This fix drops the uri prefix by using .path to get the posix-like
part of the path, so that the regex check works as expected.
@akhanf akhanf changed the title fix for cloud paths FIX: cloud paths checking against patterns Oct 14, 2024
Copy link

codecov bot commented Oct 14, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (eb2f926) to head (962cfe7).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
bids/layout/tests/test_remote_bids.py 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1094      +/-   ##
==========================================
- Coverage   89.79%   89.78%   -0.01%     
==========================================
  Files          63       64       +1     
  Lines        7141     7154      +13     
  Branches     1367      835     -532     
==========================================
+ Hits         6412     6423      +11     
- Misses        531      534       +3     
+ Partials      198      197       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

was just for local use
@effigies
Copy link
Collaborator

Could you add tests with examples? That will make it easier to be explicit about what was going wrong and allow us to consider other solutions quickly.

The check if in a derivatives folder was not working when the root had a
uri prefix. This fixes it so the uri prefix is stripped off for the
check.
Checks the number of files in an example openneuro s3 dataset. Also adds
s3fs as a test dependency.
@akhanf
Copy link
Contributor Author

akhanf commented Oct 26, 2024

Yes good idea, I've added one simple test using an s3 uri from openneuro, and also made another minor fix.

The test passes locally, but I see now it is failing in the CI as botocore can't find the credentials file.. I'll see if there is a good workaround for this later..

For reference: running the test on the code before this PR gives this error:


test_remote_bids.py F                                                                                                        [100%]

============================================================= FAILURES =============================================================
____________________________ test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] ____________________________

dataset = 's3://openneuro.org/ds000102', nb_files = 136

    @pytest.mark.parametrize(
        "dataset, nb_files",
        [
            ("s3://openneuro.org/ds000102", 136),
        ],
    )
    def test_layout_on_s3_datasets_no_derivatives(dataset, nb_files):
>       layout = BIDSLayout(dataset)

test_remote_bids.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../layout.py:177: in __init__
    _indexer(self)
../index.py:148: in __call__
    all_bfs, all_tag_dicts = self._index_dir(self._layout._root, self._config)
../index.py:239: in _index_dir
    if force or self._validate_file(abs_fn):
../index.py:162: in _validate_file
    matched_patt = _validate_path(
../index.py:64: in _validate_path
    if _check_path_matches_patterns(path, excl_patt, root=root):
../index.py:49: in _check_path_matches_patterns
    path = Path("/") / path.relative_to(root)
/usr/lib/python3.12/pathlib.py:721: in __truediv__
    return self.joinpath(key)
/usr/lib/python3.12/pathlib.py:717: in joinpath
    return self.with_segments(self, *pathsegments)
../../../.venv/lib/python3.12/site-packages/upath/core.py:560: in with_segments
    return type(self)(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[AttributeError('drive') raised in repr()] PosixUPath object at 0x7fca4b13de40>, protocol = ''
args = (PosixUPath('/'), S3Path('s3://openneuro.org/ds000102/.gitattributes')), storage_options = {}, base_options = {}
args0 = PosixUPath('/')

    def __init__(
        self, *args, protocol: str | None = None, **storage_options: Any
    ) -> None:
        # allow subclasses to customize __init__ arg parsing
        base_options = getattr(self, "_storage_options", {})
        args, protocol, storage_options = type(self)._transform_init_args(
            args, protocol or self._protocol, {**base_options, **storage_options}
        )
        if self._protocol != protocol and protocol:
            self._protocol = protocol
    
        # retrieve storage_options
        if args:
            args0 = args[0]
            if isinstance(args0, UPath):
                self._storage_options = {**args0.storage_options, **storage_options}
            else:
                if hasattr(args0, "__fspath__"):
                    _args0 = args0.__fspath__()
                else:
                    _args0 = str(args0)
                self._storage_options = type(self)._parse_storage_options(
                    _args0, protocol, storage_options
                )
        else:
            self._storage_options = storage_options.copy()
    
        # check that UPath subclasses in args are compatible
        # TODO:
        #   Future versions of UPath could verify that storage_options
        #   can be combined between UPath instances. Not sure if this
        #   is really necessary though. A warning might be enough...
        if not compatible_protocol(self._protocol, *args):
>           raise ValueError("can't combine incompatible UPath protocols")
E           ValueError: can't combine incompatible UPath protocols

../../../.venv/lib/python3.12/site-packages/upath/core.py:263: ValueError
========================================================= warnings summary =========================================================
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
bids/layout/tests/test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136]
  /local/scratch/pybids/.venv/lib/python3.12/site-packages/botocore/auth.py:424: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    datetime_now = datetime.datetime.utcnow()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================== short test summary info ======================================================
FAILED test_remote_bids.py::test_layout_on_s3_datasets_no_derivatives[s3://openneuro.org/ds000102-136] - ValueError: can't combine incompatible UPath protocols
================================================== 1 failed, 8 warnings in 1.00s ===================================================

@effigies
Copy link
Collaborator

Okay, I'm starting to get a handle on the issue. Basically, although we can specify things like anon=True for the root path, there are many places where BIDSFiles are constructed without access to layout._root to ensure that these parameters get inherited. Because BIDSFile is an ORM object, the lack of round-trip equivalence in the serialization/deserialization process is a problem.

If universal_pathlib provides an API to access these additional parameters, we could consider adding a column to the BIDSFile model and use it while reconstructing. Another approach could be to mutate BIDSFiles after creation, provided the access happens through a BIDSLayout method call that has access to ._root. Alternately, if we could somehow put this information in the SQLAlchemy session, then the BIDSFile.path property could access it on demand.

@akhanf
Copy link
Contributor Author

akhanf commented Oct 31, 2024

Ah I see.

I wonder if we could use the storage_options dict for this purpose? This is something inherited from fsspec so can be used with universal_pathlib. The api within storage_options is specific to the underlying filesystem (eg anon is there for s3fs but not for others), but I hoping that shouldn't prevent it from being passed along and serialized with the object?

@effigies
Copy link
Collaborator

If you can figure out how to get storage_options to serialize and deserialize, I'm inclined to take any approach you care to use unless it causes performance degradation for local datasets.

I've run out of time to poke at this this week and am unlikely to get back to it very soon. Ping @adelavega in case you have ideas on how to tackle this.

@akhanf
Copy link
Contributor Author

akhanf commented Oct 31, 2024

Sounds good will give it a shot, thanks for helping out with this!!

If it turns out trickier than expected, wondering if a stopgap solution is just to document that custom options are not currently passed along. Doesn't resolve the anon s3 access issue but does resolve any other use of universal paths where custom options aren't needed. Eg I didn't even know there was an issue with anon s3 access because I had a credentials file set-up already.. anyways can consider that later if/when we get to that point..

@effigies
Copy link
Collaborator

Created #1098 to track the lesser bug. Let's merge this now.

@effigies effigies merged commit 88c32ef into bids-standard:master Nov 11, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants