04 Call Digest #4

jules32 · 2024-05-16T21:28:00Z

Hi All,

Thanks to NASA Openscapes Mentors Bri Lind and Mahsa Jami from LP DAAC, and Cassie Nickles from PO.DAAC for teaching this week! It was great to hear from Justin Rice, Deputy Project Manager/Data Systems for the ESDIS Project Office at NASA Goddard Space Flight Center about the status updates for the Cloud. Together we covered open communities and coding strategies to leverage the power of the cloud through parallelization – Below is a light digest of Call 4.

Have a great week!
Julie, Erin, Andy, Liz, and the very awesome NASA Openscapes Mentors 🚀

Digest: Cohort Call 04 [ 2024-nasa-champions ]
Openscapes_CohortCalls [ 2024-nasa-champions ] Google folder - contains agendas, recordings, pathways
https://openscapes.github.io/2024-nasa-champions - cohort webpage

Goals: We’ll discuss open communities and coding strategies for future us in the Cloud. Additionally, NASA Earthdata Cloud Update - Special Guest Justin Rice, NASA Goddard Space Flight Center, ESDIS Project Office, Deputy Project Manager/Data Systems

Task: Have a Seaside Chat and prepare your Pathways presentation (more details in our agenda)

Prepare to present your Pathway work-in-progress on our final call - Each group has 5 minutes to share their pathway: (3 min present + 2 min Questions)
Coworking (optional): May 23. 10-11:30 PT
- Come work on your Pathways, or run code in the JupyterHub together, ask questions.
- A chance to work on your own things socially & ask questions/ screenshare. We share what we’re going to work on, and then work quietly and then check in at the end as well. We also make breakout rooms for Q&A if folks want to screenshare and talk things out.

Slide Decks:

Open communities (slides)
NASA Earthdata Cloud Cookbook (cookbook)
Earth Science Data & Information System (ESDIS) Update (slides)
Coding strategies for Future Us (slides)

More open communities!!! 🥰

Center for Scientific Collaboration and Community Engagement (CSCCE): https://www.cscce.org/
Cloud Native Computing Foundation (CNCF): https://www.cncf.io/
US Research Software Engineering (US-RSE): https://us-rse.org/ +1
Research Data Alliance (RDA): https://www.rd-alliance.org/
Earth Science Information Partners (ESIP): https://esipfed.org/
NASA Earth Science Data System Working Groups (ESDSWG)
CryoCloud: https://book.cryointhecloud.com/intro.html
Project Pythia: https://projectpythia.org/dask-cookbook/

A few lines from shared notes in the Agenda doc:

Folks aren’t connecting on Twitter anymore, where are folks these days?
Every community I’m a part of is struggling with this! One article making the rounds is https://joanwestenberg.com/blog/breaking-up-with-slack-and-discord-why-its-time-to-bring-back-forums
Hard to have time for open communities when so much is going, but really fantastic to have a space to troubleshoot and get different perspectives
"pleasingly parallel" - tasks that are completely independent from each other. For example, to validate whether each value in a dataset is within a threshold.
A lot of parallel computing presupposes access to cloud computing or HPC resources which costs money, it would be nice to have some talk about using the multiple cores within a laptop
- +1
This has all been very helpful. Not only are the tools and training great, but NMFS is embarking on large programs (e.g., CEFI) that could really use these approaches. And, given that we’re about done, how about we make these meetings sort of permanent events?! Everybody in?!
How do you get started knowing what kind of computing resources you actually need, many of my datasets are not that large so I typically operate with the mantra that it does not matter how I do stuff because it just works but this will breakdown at sometime and I would like to know how to determine the resources that I am using.
- to my experience, running the workflow for a few sample files and tracking the memory allocation and memory time series from dashboard could give you an idea
Having Justin emphasize subsetting / opendap in the cloud right before Mahsa explained chunking and parallels … i think it’s all coming together for me now. Subsetting isn’t just important when you only need a small slice of the data, subsetting when reading directly from S3 is critical for chunks to independently load data, even when you’re going to process the whole array. Is that right?
- From Luis Lopez: With this small edit, the sentence is totally correct:
- Having Justin emphasize subsetting / opendap in the cloud right before Mahsa explained chunking and parallels … i think it’s all coming together for me now. Chunking isn’t just important when you only need a small slice of the data, Chunking when reading directly from S3 is critical for chunks to independently load data, even when you’re going to process the whole array.
How do you rethink your code so it can be parallel?
- Inspect your for-loops or nested for-loops, wh‎at is the goal and how could you identify the parts that are
Resources shared from NOAA Enterprise Data Management Workshop

jules32 mentioned this issue May 16, 2024

After Cohort Call 4 [ 2024-nasa-champions ] NASA-Openscapes/cohort-planning#14

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04 Call Digest #4

04 Call Digest #4

jules32 commented May 16, 2024 •

edited

Loading

04 Call Digest #4

04 Call Digest #4

Comments

jules32 commented May 16, 2024 • edited Loading

jules32 commented May 16, 2024 •

edited

Loading