Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Few trivial changes we've noted #5

Open
c200chromebook opened this issue May 2, 2022 · 1 comment
Open

Few trivial changes we've noted #5

c200chromebook opened this issue May 2, 2022 · 1 comment

Comments

@c200chromebook
Copy link

c200chromebook commented May 2, 2022

  1. Consider a timeout as nodes are joining, eg:
timeout=0
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
  timeout=$((timeout + 1))
  if [ $timeout -gt 240 ]; then
    echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
    exit 1
  fi
  log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
  sleep 1
  lines=$(uniq $HOST_FILE_PATH|wc -l)
done

Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.

  1. For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems: --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0

  2. A small modification to reflect whatever the application returned in the status of the job:

  <user's logic>
  RESULT_CODE=$?
  sleep 2
  log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
  echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE
@c200chromebook
Copy link
Author

Oh yes! There's one other thing. I think once the master terminates, the other nodes hang on for 30 seconds, likely given a sigterm, then a sigkill. Would be nice to remove those 30s at some point, though obviously not critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant