You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a benchmark there are heavy computations that for example takes a few minutes for each module instance; and light computations where each instance takes a fraction of seconds. Currently we have a mechanism to specify it such that heavy computations are submitted as jobs on the cluster and lighter applications will run directly on the node where jobs are submitted.
However here the limitation is that the smaller jobs still have to run on a single node eg the login node and there are limited control over the resource it uses, eg, number of CPU threads, memory (at least some control over memory) and walltime. It would is not very good to run computations on a login node anyways. A possible way out would be to parse the benchmark and use a dedicated compute node for these light jobs where resource usages are still under control; but without the per job queue and thus avoiding most of the interaction (overhead) with the queue system.
The text was updated successfully, but these errors were encountered:
@pcarbo true. My proposed solution is essentially an extension to it by reserving multiple compute nodes and run these jobs, not just one node. The difference between submitting jobs is that a fixed number of multiple compute nodes are reserved up front for light jobs throughout the entire DSC; versus currently each module will have reserve nodes, run jobs, give up the reservation, and other modules come in to reserve new nodes -- this is higher overhead.
In a benchmark there are heavy computations that for example takes a few minutes for each module instance; and light computations where each instance takes a fraction of seconds. Currently we have a mechanism to specify it such that heavy computations are submitted as jobs on the cluster and lighter applications will run directly on the node where jobs are submitted.
However here the limitation is that the smaller jobs still have to run on a single node eg the login node and there are limited control over the resource it uses, eg, number of CPU threads, memory (at least some control over memory) and walltime. It would is not very good to run computations on a login node anyways. A possible way out would be to parse the benchmark and use a dedicated compute node for these light jobs where resource usages are still under control; but without the per job queue and thus avoiding most of the interaction (overhead) with the queue system.
The text was updated successfully, but these errors were encountered: