A Balsam Site does not automatically use your computing allocation by default.
Instead, we must use the BatchJob API or balsam queue
CLI to
explicitly request compute nodes at a given Balsam Site. This follows the
principle of least surprise: as the User, you explicitly decide when and how
resources are spent. This is not a limitation since we can remotely submit
BatchJobs
to any Site from a computer where Balsam is installed!
However, we can opt-in to enable the elastic_queue
plugin that creates
BatchJob
submissions on our behalf. This can be a useful service
in bursty or real-time workloads: instead of micro-managing the queues, we
simply submit a stream of Jobs
and allow the Site to provision resources
as needed over time.
Auto-scaling is enabled at a Site by setting the elastic_queue
configuration appropriately inside of the settings.yml
file.
You should find this line uncommented by default:
elastic_queue: null
Following this line is a commented-block showing an example elastic_queue
configuration. We want to comment out the elastic_queue: null
line and uncomment the configuration; setting each of the parameters appropriately for our use case.
Once the Plugin has been configured properly (see below), we must restart the Balsam Site Agent to load it:
$ balsam site sync
# Or if the Site isn't already running:
$ balsam site start
To disable, comment out (or delete) the current elastic_queue
configuration in settings.yml
and replace it with the line:
elastic_queue: null
Then restart the Balsam Site to run without the elastic queue plugin:
$ balsam site sync
The configuration is fairly flexible to enable a wide range of use cases. This section explains the YAML configuration in chunks.
Firstly, service_period
controls the waiting period (in seconds) between
cycles in which a new BatchJob
might be submitted to the queue. The
submit_project
, submit_queue
, and job_mode
are directly passed through to
the new BatchJob
.
The max_queue_wait_time_min
determines how long a submitted BatchJob
should
be enqueued before the elastic queue deletes it and tries re-submitting. When
using backfill to grab idle nodes (see next section), it makes sense to set a
relatively short waiting time of 5-10 minutes. Otherwise, this duration should
be increased to a reasonable upper threshold to avoid deleting BatchJobs
that
have accrued priority in the queues.
The elastic queue will maintain up to max_queued_jobs
in the queue at any
given time. This should be set to the maximum desired (or allowed) number of
simultaneously queued/running BatchJobs
at the Site.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Many HPC systems use backfilling schedulers, which attempt to place small Jobs
while draining nodes for larger Jobs to start up at a determined future time.
By opportunistically sizing jobs to fit into these idle node-hour windows,
Balsam effectively "fills the gaps" in unused resources. We enable this dynamic
sizing with use_backfill: True
.
The interpretation of min_wall_time_min
and max_wall_time_min
depends on whether or not use_backfill
is enabled:
- When
use_backfill
isFalse
:min_wall_time_min
is ignored and BatchJobs are submitted for a constant wallclock time limit ofmax_wall_time_min
. - When
use_backfill
isTrue
: Balsam selects backfill windows that are at least as long asmin_wall_time_min
(this is to avoid futile 5 minute submissions when all Jobs take at least 30 minutes). The wallclock time limit is then the lesser of the scheduler's backfill duration andmax_wall_time_min
. - Finally, a "padding" value of
wall_time_pad_min
is subtracted from the final wallclock time in allBatchJob
submissions. This should be set to a couple minutes whenuse_backfill
isTrue
and0
otherwise.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Finally, the min_num_nodes
and max_num_nodes
determine the permissible range
of node counts in submitted BatchJobs
.
When operating with the use_backfill=True
constraint, backfill windows
smaller than min_num_nodes
will be ignored. Otherwise, BatchJob submissions use min_num_nodes
as a lower bound. Likewise, max_num_nodes
gives an upper bound on the BatchJob's node count.
The actual submitted BatchJob node count falls somewhere in this range. It is determined from the difference between how many nodes are currently requested (queued or running BatchJobs) and the aggregate node footprint of all runnable Jobs.
elastic_queue:
service_period: 60
submit_project: "datascience"
submit_queue: "balsam"
job_mode: "mpi"
use_backfill: True
min_wall_time_min: 35
max_wall_time_min: 360
wall_time_pad_min: 5
min_num_nodes: 20
max_num_nodes: 127
max_queue_wait_time_min: 10
max_queued_jobs: 20
Therefore, the elastic queue automatically controls the size and number
of requested BatchJobs as the workload grows. We can think of each BatchJob
as a flexibly-sized block of resources, and the elastic queue creates multiple
blocks (one per service_period
) while choosing their sizes. If one BatchJob
does not accommodate the incoming volume of tasks, then multiple BatchJobs of the
maximum size are submitted at each iteration.
When the incoming Jobs slow down and the backlog falls inside the
(min_num_nodes, max_num_nodes)
range, the BatchJobs
reduce down to a single,
smaller allocation of resources. As utilization decreases and launchers
become idle, the nodes are released according to the launcher's idle_ttl_sec
configuration (also in settings.yml
).