You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
timeout=0
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
timeout=$((timeout + 1))
if [ $timeout -gt 240 ]; then
echo "All nodes not joined within 4 minutes. Terminating. Recommend rerun."
exit 1
fi
log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, will check again in 1 second"
sleep 1
lines=$(uniq $HOST_FILE_PATH|wc -l)
done
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.
For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems: --mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0
A small modification to reflect whatever the application returned in the status of the job:
<user's logic>
RESULT_CODE=$?
sleep 2
log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
echo $RESULT_CODE > $AWS_BATCH_EXIT_CODE_FILE
The text was updated successfully, but these errors were encountered:
Oh yes! There's one other thing. I think once the master terminates, the other nodes hang on for 30 seconds, likely given a sigterm, then a sigkill. Would be nice to remove those 30s at some point, though obviously not critical.
Should a node fail during startup, the master and other workers will spin until the overall timeout kills it. You can get rid of it quicker by limiting the join time.
For TCP, you'll want an appropriate set of flags. The last one is key here or the MPI network gets a packet from an IP it doesn't expect, causing all kinds of problems:
--mca pml ob1 --mca btl tcp,self --mca btl_tcp_if_include eth0
A small modification to reflect whatever the application returned in the status of the job:
The text was updated successfully, but these errors were encountered: