-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Warn about difficulty of pulling data from HPC #639
Comments
Thanks for raising this Tom. The Pangeo Forge Pulls Data header is probably the most natural place to mention this. Perhaps an easy first step would be promoting the Maybe cross-linking this section to the FAQs would also surface it a better. (And possibly moving the FAQs into the Getting Started section makes sense as well.) |
I did not see / appreciate that bit of the docs! That already covers a lot of what I had in mind. Maybe explicitly pointing out that HPC systems generally do not implement ways of accessing data over URLs would be an improvement? |
Sounds good to me! PRs welcome! 😄 pangeo-forge-recipes/docs/composition/file_patterns.md Lines 28 to 49 in 4aae78f
|
@TomNicholas for your specific case, are you able to push data from your HPC to a GCS bucket? If so the recipe could be written against the GCS cache. |
Would this not benefit from the daskrunner? That would allow scaling out on HPC, reading from the local filesystem (which pangeo-forge already supports because fsspec supports it) and then writing it out either to the cloud (via individual user credentials) or to the filesystem again? |
@cisaacstern I think so, but that might be really slow if I can only push to a login node Also the dataset needs to end up in an AWS bucket (AWS Open Data program), so am I likely to face large egress charges moving data from GCS to AWS?
Hopefully! Although I need to find out if the NCAR compute nodes can actually write out to the public cloud, otherwise I can't scale out because I would be limited to writing from a login node. |
I think a larger question here is 'how cloud specific is pangeo-forge?'. The primary reason I got involved in the project (and started working on https://github.com/pangeo-forge/pangeo-forge-runner) is that I want it to be not cloud provider specific nor even require clouds. This is why I pushed to move all 'submission' and 'status' code out of the pangeo-forge-orchestrator (which was cloud and public GitHub specific) into runner. So while I agree that currently it's still tied to the cloud, I think with runner it need not be - especially with daskrunner, I think it can work just as well on HPC systems as it does in the cloud, although the specifics of how it is configured have to be different. One primary is 'what is the equivalent of object storage on HPC systems?'. I think if you're running them on HPC systems that don't have any object storage deployed, the closest equivalent is probably whatever 'fast' scratch setup they have (something like Lustre mounted over NFS perhaps?). It'll be slower than running it on a cloud provider with object storage, but faster than reaching out to S3 from within your S3 provider. So the pattern for running this on HPC would be that the source data is on the fast local filesystem, the destination data should also be on a fast local filesystem, and then after you are done you can publish out to global cloud for external use. I think the 'final publishing' use case of public cloud object storage should be different from the 'intermediate output' use case. If the job is running on the cloud, then the 'intermediate output' location and the 'final publishing' location can be the same. On HPC systems I think this would be different. All that said, until Dask Runner lands on Beam properly, on HPC systems too you'd be limited to non-scale out performance anyway. But I do believe that means that for meaningful HPC use, the best way to push that forward is to get daskrunner to completion. |
Potentially? I think it would depend on specifics but worth being cautious about.
As @yuvipanda helpfully observes, in a future DaskRunner world, it may very well make sense for Pangeo Forge work to happen entirely within the HPC storage context, with the final "publish" step happening as a subsequent forklifting of the pre-built ARCO data from the HPC filesystem to the cloud bucket. In this case, the question becomes, how does anyone working on this NCAR HPC ever efficiently move data to the cloud?
Thanks for this thoughtful reflection @yuvipanda. I agree. And this resonates with what @rabernat and I discussed yesterday: namely, that the DaskRunner represents a very important (possibly indispensable) on-ramp for the scientific community. My evolving understanding is that the DaskRunner as currently released in Beam implements only |
Thanks for the comments @yuvipanda !
They have Globus, it turns out, which hopefully I can use. Happy to contribute to integrating this with pangeo-forge, as @jbusecke suggested to me yesterday. I am expecting to need to do several more of these types of data moving tasks, from HPC to cloud.
So basically I run the recipe on HPC, doing any data transformation to a temporary intermediate state on the HPC system itself (hopefully in parallel by using the daskrunner), then at the end write the result out to Cloud (AWS or Google) using Globus? How does this fit with the "pangeo-forge only pulls data" idea? |
So basically I run the recipe on HPC, doing any data transformation to a temporary intermediate state on the HPC system itself (hopefully in parallel by using the daskrunner), then at the end write the result out to Cloud (AWS or Google) using Globus? Correct. To be more explicit, pangeo-forge puts the end ARCO result on the HPC system itself, and then after that you can use any system to move it to the cloud.
I don't actually know what this means! Can you explain this a little more? |
I was referring to this passage in the docs: https://pangeo-forge.readthedocs.io/en/latest/composition/file_patterns.html#pangeo-forge-pulls-data |
@TomNicholas in this case your Pangeo Forge pipeline will be pulling data from the HPC filesystem. |
Correct. There will be no external cloud involvement at all from the pangeo-forge perspective. The compute is dask, and it's pulling data from the HPC system's filesystems and putting data back there. The final 'push' to cloud object storage doesn't involve pangeo forge at all (although we should provide documentation on how to do it) |
Hey folks, super useful discussion. I want to throw another use-case in the mix here: While I appreciate the push towards making this work from 'within' the HPC, and I very much agree with the need for the daskrunner! , I think that the globus route offers a potentially much more generalizable method of ingestion from HPCs centers in the short term. Given that I will not be able to 'upload from within' working on this alternative workflow below seems like a better opportunity for collaboration between folks here? Proposed workflow
The one awkward part here is the need for someone to actually 'put' the files into a Globus collection, which kinda violates the principle of PGF, but I think can just be seen as a 'flaky' data source. The easy of use, and unified (globus) api for this IMO outweighs this. I want to stress that I think this is complimentary to the above approach (which we also need for using beam beyond data ingestion). I think ultimately we should implement both, but for mostly selfish reasons 😁, I would prefer the 'outside of HPC' method. |
But this could also be a pangeo-forge stage, right? Even if it is just a single worker dummy stage, uploading the data, I think for reproducibility reasons it would be good to include this in the pipeline. |
I just had a long and useful chat with @jbusecke, who corrected several misconceptions I had about what pangeo-forge was and how to use it. One misconception I had in particular was that I assumed it would be relatively easy to pull data from a HPC system into dataflow. I now understand that this is definitely not the case, and I will be lucky if NCAR supports uploading data via Globus or even an FTP server. 🥲
I think this is important context for understanding what pangeo forge can and can't do, as I think many users will be in the same position as me: "I have a simulation dataset sat on HPC, and I want to move it to ARCO Zarr data in the Cloud". It was not at all obvious to me that the main intended use case for pangeo-forge was pulling data that is already available publicly.
Could we find some way to document this better on the pangeo-forge-recipes documentation? Maybe also with some current recommendations as to what to do in this situation? I understand that this is not yet a solved problem.
The text was updated successfully, but these errors were encountered: