-
Notifications
You must be signed in to change notification settings - Fork 167
Issues: intelligent-machine-learning/dlrover
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
dlorver适配新的加速器类型以及实现类似Nvidia_gpu.py脚本
question
Further information is requested
#1338
opened Nov 15, 2024 by
lulu-0126
AttributeError: module 'collections' has no attribute 'Sequence'
investigating
#1332
opened Nov 12, 2024 by
linzhidao1010
Could DLRover be able to apply to the diffusion transformer training? And combined with deepspeed?
question
Further information is requested
#1314
opened Oct 29, 2024 by
TomSuen
Add balance loss in atorch moe example
Hacktoberfest
todo
issue or pr with 'todo' will ignore expiration
#1300
opened Oct 18, 2024 by
skydoorkai
How does dlrover make sure all the nodes in one job are in one switch
question
Further information is requested
#1298
opened Oct 17, 2024 by
gangxie112
add xpu monitor for dlrover
Hacktoberfest
todo
issue or pr with 'todo' will ignore expiration
#1290
opened Oct 12, 2024 by
majieyue
Question: How DLRover integrate with Llama Factory?
question
Further information is requested
#1244
opened Aug 21, 2024 by
hetingyou
What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
question
Further information is requested
#1243
opened Aug 19, 2024 by
dotsonliu
xpu timer python package
todo
issue or pr with 'todo' will ignore expiration
#1159
opened Jun 17, 2024 by
zxyyzx
megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
help wanted
Extra attention is needed
investigating
The job stops restarting workers and exits if the traceback is a code bug.
enhancement
New feature or request
question
Further information is requested
todo
issue or pr with 'todo' will ignore expiration
ProTip!
Mix and match filters to narrow down what you’re looking for.