-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll tuned dynamic rules file alltoall_algorithm_max_requests #12827
coll tuned dynamic rules file alltoall_algorithm_max_requests #12827
Conversation
@janjust @jsquyres could one of you give a review? or advise on who could? I'm reaching out to you as you are tagged on the issue (#12589). for context. On our cluster OpenMPI's fixed decision infrastructure does not choose good transitions between all-to-all algorithms. I found that OpenMPI's About this patch: I was not sure if a backward incompatible change to the rules file format would be accepted. That is why I made the new parameter optional. Old rules files will work unmodified. However, I could see pros and cons to both ways. Would be happy to revise to address any concerns. |
b1767b2
to
7b505b2
Compare
no new changes. I rebased from main so that this patch doesn't fall too far behind. Here's rules file that I used to test on TACC Vista's Grace-Hopper partition. Some output
and again for a different message size
All seems to be working. I also tested that when max requests is not specified in the rules file by removing it from every other line in the rules file. I verified that the command line setting |
failed test in CI run above is not related to the patch but rather one of the CI systems being offline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is what is happening with an older version (let's say based on the 4.x) reading a configuration file that does contain this extra value ? Will it just ignore everything until the EOL or will it consider it incorrect and stop completely reading the configuration file?
Wenduo and I discussed an idea to re-implement a tuning file format in json, especially now that we have an opal/util/json. Our motivation was that for allreduce we have a different algorithm we want to use when the communicator is "disjoint" vs not, but we could not find a way to indicate this in the tuning file without breaking the existing format. However I haven't had time to research it enough to come up with even a strong suggestion for a json format. I do believe if we could agree on a new format, then the implementation to read a new json file could be complete in only a few days. |
7b505b2
to
5437cac
Compare
I'd like this patch to be applied to both the 4.x and 5.x branches because I'm using both of these depending on what the admins I'm running on have decided. Most importantly, a config file that works with the older release of OpenMPI will still work going forward. Your point is that a new config file with the extra value will not work with an older release of OpenMPI. That's a limitation of the parser. I can't go back in time and rewrite it to be version aware. However, the release version below which the feature does not work could be documented in the user guide. Would this resolve your concern? I would also be willing to make the parser version aware. It's a relatively easy change that would prevent this issue in the future. |
I find Yaml to be a bit easier to write than Json. It's personal preference though. Either way it would be a nice improvement. |
I went though the selection code and if my reading (and recollection) of that code is correct we have a problem. The loop reading the different rules expects 4 longs per rules (message size, algorithm, fanin and segment size). Assuming we have 2 correct rules from an old configuration file here are the tokens that will exists in the configuration file:
If we try to read this configuration file with the code from this PR, we will get the wrong output because the function to read the next token ( We will have a similar issue if we are reading a new file with the old parser. Here I added a max request at the end of each message rule.
The old parser will read 3 as the message size for the next rule instead of simply ignoring it. |
You're wrong about that here's why. |
I found one way to solve this, and have added a commit to the PR with the fix (2da8e6d). This adds support in the parser for a version identifier: |
here is a simple working example of the rule file that can be used for testing.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you initialized version
but I don't see it used in the code. If you protect the use of isnext_digit
with a check for the version is should work better.
2da8e6d
to
7884ae8
Compare
@burlen all looks good. Can you please add or update the copyright in the files you touched. Something like this:
|
7884ae8
to
f302969
Compare
Teach the dynamic rules file reader to look for the alltoall_algorithm_max_requests tuning parameter. To keep the dynamic rules file format backward compatible the alltoall_algorithm_max_requests is optional. When not present in the rule definition the value of the corresponding MCA variable is used instead. Resolves open-mpi#12589 Signed-off-by: Burlen Loring <[email protected]>
the version identifier is optional but when provided it must have the following format and must appear on the first line.`rule-file-version-N` where N is an unsigned integer. Older versions of the parser will fall back to fixed decision mechanism when this line is present. Version 1 is the original format, Version 2 has support for optional coll_tuned_alltoall_algorithm_max_requests specification. Signed-off-by: Burlen Loring <[email protected]>
f302969
to
f6387a4
Compare
updated copyright and rebased from main |
Teach the coll tuned dynamic rules file reader to look for the
alltoall_algorithm_max_requests
tuning parameter. To keep the dynamic rules file format backward compatible the presence of thealltoall_algorithm_max_requests
is optional. When not present in the rule definition the value of the corresponding MCA variable is used instead.Resolves #12589