dra-evolution: counter per request, select device per index for container #22

pohly · 2024-06-05T07:30:27Z

This was removed from #14 to limit the scope and because adding it would make the implementation more complex.

For the "give me management access to all devices on a node" use case we need a way to specify "all devices". A counter with minimum 1 and no maximum would be one solution. Management access is currently targeted for 1.31, so perhaps a counter should be, too?

/assign

johnbelamaric · 2024-06-06T19:51:02Z

Summary of options we discussed in Slack:

Option 1: No special API for it. Instead, requests are for 1 device, unless they are "management" requests, in which case they are for "all devices on the node matching the requirements". Management requests ignore existing allocations.
- Pros: simple/no API
- Cons: Implicit magic can be confusing or surprising
Option 2: As in the comment above, add a "count" for devices in a request that can be either an integer or a min/max range. If "max" is not set, this means "all available". Since management requests ignore existing allocations, "all available" is the same as "all devices on the node matching the requirements".
- Pros: adds functionality with the ability to allocate "all available" for both management requests and ordinary requests.
- Cons:
  - "all available" being different for management requests and ordinary requests is somewhat implicit magic
  - no clear use case for "all available" in ordinary requests (at least, we haven't heard this one before; "all" on the other hand does have use cases for ordinary requests)
  - it's not clear what the scheduler should do if the range is n-m. Pick min? Max? There is no reason to pick any other value because we don't have aggregate constraints yet ("give me n-m GPUs such that total memory is at least 1Gi")
  - being able to allocate multiple devices in a request means we would likely want to add match attributes at the request level (which we plan to do in the future, but this adds more scope now)
Option 3: Add API surface to allow a user to ask for "all" devices on a node, matching the requirements. For ordinary requests, a node can only satisfy this if no device is allocated. For management requests, it is always satisfiable (assuming at least one devices matches the requirements).
- Pros:
  - Compared to options 1 & 2, has an explicit, straightforward API
  - Compared to option 1, increases functionality, because now we have "all" for ordinary requests, too.
  - Compared to option 2, avoids over expressibility (range)
  - Compared to option 2, avoids pressure for "match attributes" in the request for this release, since you will get everything (no intra-node topology consideration)
- Cons:
  - It's not clear anyone would ever want "management && 1 device", so this makes people specify two fields - "managament" and "give me all" in order to get what they actually want.
- Not enough info to know if it's a pro or con: Compared to option 2, avoids "all available" for ordinary requests (unknown if there are real use cases for that semantic; if so, we can make sure the api can grow to support it)
Option 4: Like option 3, but with "management" defaulting to "all: true" and regular requests defaulting to "all: false".
- Pros: all the same ones of 3
- Cons: trades the con in 3 for the "implicit magic" con

@pohly @klueska

Personally I still lean to option 1 - it's less work and I am quite conscious of the ticking clock. Though 3 or 4 is nice too, since I think people would use that to ensure they get the whole node.

pohly · 2024-06-07T08:51:35Z

option 2: it's not clear what the scheduler should do if the range is n-m. Pick min? Max? There is no reason to pick any other value because we don't have aggregate constraints yet ("give me n-m GPUs such that total memory is at least 1Gi")

I'd say as many as possible. Quota would be considered, so that and what's still available would be the limiting factor.

One use case that people sometimes asked for was "optional allocation": give me a device if there is one, but proceed without it if not. That's for pods which can take advantage of an accelerator if they can get it, but have fallbacks if not. I'm fine with not covering that yet, but we should keep it in mind.

I like the explicit "all" and I think a fixed amount also would be very useful because then users don't need to replicate identical requests.

So let's brainstorm the API...

requests:
- requestName: gpus
  amount: # one-of!
    all: true # all: false does not make sense, so this would be a `All bool `json:"all,omitempty"`
  deviceClass: gpu.dra.example.com

requests:
- requestName: gpus
  amount: # one-of!
    count: 4
  deviceClass: gpu.dra.example.com

requests:
- requestName: gpu
  # no amount: defaults to 1
  deviceClass: gpu.dra.example.com

By making this a one-of, we can add a range or optional allocation later without breaking old clients.

johnbelamaric · 2024-06-07T18:59:22Z

To be clear: all here means "all on the node meeting the requirements, fail to schedule if you cannot get all". And it does differentiate that "management" doesn't allocate so it basically always succeeds.

Later we can add range, optional allocation, and even allAvailable to get "as many as possible", as needed.

lgtm!

johnbelamaric · 2024-06-07T19:00:41Z

One thing - does this mean we need matchAttributes within a request? Or we can leave that out and add it next release? They can still do match attributes across requests (which in fact implies it within the request...so that's almost as good).

pohly · 2024-06-08T07:14:42Z

does this mean we need matchAttributes within a request

I had that in some earlier proposal. I think it's okay to leave it out because as you said, the claim-level constraints apply and should cover most cases.

pohly · 2024-06-08T07:16:52Z

There is one more complication that we need to warn users about: the size of slices in the API is limited, which limits how many devices can be allocated per claim (via the results slice). Even a single request can be too much if the count is too high. I'll add that to the docs in a PR for this issue and it might be worth adding some validation for this.

pohly · 2024-06-08T20:08:55Z

There's another problem with "all devices on a node": the way we have currently defined ResourcePool, a scheduler cannot be certain that it has seen all ResourcePool objects published by the driver. It might schedule a daemonset pod as soon as it has seen one device in one object, with access to just that one device, while the driver is still publishing 10 more.

I have some ideas how we can define our pool concept better to avoid that problem. I'll put that and the amount API into kubernetes/enhancements#4709 - probably tomorrow.

pohly · 2024-06-09T10:14:50Z

I updated kubernetes/enhancements#4709 with the revised pool concept and "amount" per request. Let's discuss directly there, I am not going to do a prototype PR.

k8s-triage-robot · 2024-09-07T11:14:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot assigned pohly Jun 5, 2024

pohly mentioned this issue Jun 9, 2024

KEP-4381: DRA update for 1.31 kubernetes/enhancements#4709

Merged

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024

pohly closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dra-evolution: counter per request, select device per index for container #22

dra-evolution: counter per request, select device per index for container #22

pohly commented Jun 5, 2024

johnbelamaric commented Jun 6, 2024 •

edited

Loading

pohly commented Jun 7, 2024 •

edited

Loading

johnbelamaric commented Jun 7, 2024

johnbelamaric commented Jun 7, 2024

pohly commented Jun 8, 2024

pohly commented Jun 8, 2024

pohly commented Jun 8, 2024

pohly commented Jun 9, 2024

k8s-triage-robot commented Sep 7, 2024

dra-evolution: counter per request, select device per index for container #22

dra-evolution: counter per request, select device per index for container #22

Comments

pohly commented Jun 5, 2024

johnbelamaric commented Jun 6, 2024 • edited Loading

pohly commented Jun 7, 2024 • edited Loading

johnbelamaric commented Jun 7, 2024

johnbelamaric commented Jun 7, 2024

pohly commented Jun 8, 2024

pohly commented Jun 8, 2024

pohly commented Jun 8, 2024

pohly commented Jun 9, 2024

k8s-triage-robot commented Sep 7, 2024

johnbelamaric commented Jun 6, 2024 •

edited

Loading

pohly commented Jun 7, 2024 •

edited

Loading