Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Error Handling and Regex Caching in Tensor Loading #221

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Madhav-MKNC
Copy link

This PR introduces two key enhancements to the tensor loading process: (Fixes #220)

  • Improved error handling within ThreadPoolExecutor to provide detailed logs for failures during parallel tensor loading.
  • Implementation of caching for regex operations in get_load_path_str to reduce computational overhead and improve loading efficiency.

These changes aim to enhance the robustness and performance of tensor loading, particularly in distributed computing environments.

checkpoint.py Outdated Show resolved Hide resolved
@Aareon
Copy link

Aareon commented Mar 22, 2024

Hey! I'm happy to see a limit set for the memory usage in LRU Cache. Would it be possible for you to roll a test for this change? I think a pytest test would suffice.

@Aareon
Copy link

Aareon commented Mar 22, 2024

Perhaps path_tuple_to_string can also utilize the LRU cache, as well. Just a thought.

"""
For path_tuple_to_string(),
introducing a simple caching mechanism to avoid recomputing regex matches for paths that have already been processed.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this docstring supposed to be here, or inside the path_tuple_to_string method?

@Aareon
Copy link

Aareon commented Mar 27, 2024

After delving a bit. I now think 32MB in LRU cache may be too much overkill. I don't think more than 250kb at most should be necessary.

@RaphaelFakhri
Copy link

RaphaelFakhri commented Mar 30, 2024

"delving" hahahaha (for reference: https://x.com/JeremyNguyenPhD/status/1774021645709295840)

@Aareon
Copy link

Aareon commented Apr 5, 2024

"delving" hahahaha (for reference: https://x.com/JeremyNguyenPhD/status/1774021645709295840)

Hadn't seen this, and the stat doesn't really correspond with anything related to ChatGPT. I most certainly didn't need to use ChatGPT to come to that conclusion.

I recommend trying both ways and running benchmarks to see which provides the best performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancements for Error Handling and Regex Operation Optimization in Distributed Tensor Loading
3 participants