Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra-evolution: counter per request, select device per index for container #22

Closed
pohly opened this issue Jun 5, 2024 · 9 comments
Closed
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pohly
Copy link
Contributor

pohly commented Jun 5, 2024

This was removed from #14 to limit the scope and because adding it would make the implementation more complex.

For the "give me management access to all devices on a node" use case we need a way to specify "all devices". A counter with minimum 1 and no maximum would be one solution. Management access is currently targeted for 1.31, so perhaps a counter should be, too?

/assign

@johnbelamaric
Copy link
Contributor

johnbelamaric commented Jun 6, 2024

Summary of options we discussed in Slack:

  • Option 1: No special API for it. Instead, requests are for 1 device, unless they are "management" requests, in which case they are for "all devices on the node matching the requirements". Management requests ignore existing allocations.
    • Pros: simple/no API
    • Cons: Implicit magic can be confusing or surprising
  • Option 2: As in the comment above, add a "count" for devices in a request that can be either an integer or a min/max range. If "max" is not set, this means "all available". Since management requests ignore existing allocations, "all available" is the same as "all devices on the node matching the requirements".
    • Pros: adds functionality with the ability to allocate "all available" for both management requests and ordinary requests.
    • Cons:
      • "all available" being different for management requests and ordinary requests is somewhat implicit magic
      • no clear use case for "all available" in ordinary requests (at least, we haven't heard this one before; "all" on the other hand does have use cases for ordinary requests)
      • it's not clear what the scheduler should do if the range is n-m. Pick min? Max? There is no reason to pick any other value because we don't have aggregate constraints yet ("give me n-m GPUs such that total memory is at least 1Gi")
      • being able to allocate multiple devices in a request means we would likely want to add match attributes at the request level (which we plan to do in the future, but this adds more scope now)
  • Option 3: Add API surface to allow a user to ask for "all" devices on a node, matching the requirements. For ordinary requests, a node can only satisfy this if no device is allocated. For management requests, it is always satisfiable (assuming at least one devices matches the requirements).
    • Pros:
      • Compared to options 1 & 2, has an explicit, straightforward API
      • Compared to option 1, increases functionality, because now we have "all" for ordinary requests, too.
      • Compared to option 2, avoids over expressibility (range)
      • Compared to option 2, avoids pressure for "match attributes" in the request for this release, since you will get everything (no intra-node topology consideration)
    • Cons:
      • It's not clear anyone would ever want "management && 1 device", so this makes people specify two fields - "managament" and "give me all" in order to get what they actually want.
    • Not enough info to know if it's a pro or con: Compared to option 2, avoids "all available" for ordinary requests (unknown if there are real use cases for that semantic; if so, we can make sure the api can grow to support it)
  • Option 4: Like option 3, but with "management" defaulting to "all: true" and regular requests defaulting to "all: false".
    • Pros: all the same ones of 3
    • Cons: trades the con in 3 for the "implicit magic" con

@pohly @klueska

Personally I still lean to option 1 - it's less work and I am quite conscious of the ticking clock. Though 3 or 4 is nice too, since I think people would use that to ensure they get the whole node.

@pohly
Copy link
Contributor Author

pohly commented Jun 7, 2024

option 2: it's not clear what the scheduler should do if the range is n-m. Pick min? Max? There is no reason to pick any other value because we don't have aggregate constraints yet ("give me n-m GPUs such that total memory is at least 1Gi")

I'd say as many as possible. Quota would be considered, so that and what's still available would be the limiting factor.

One use case that people sometimes asked for was "optional allocation": give me a device if there is one, but proceed without it if not. That's for pods which can take advantage of an accelerator if they can get it, but have fallbacks if not. I'm fine with not covering that yet, but we should keep it in mind.

I like the explicit "all" and I think a fixed amount also would be very useful because then users don't need to replicate identical requests.

So let's brainstorm the API...

requests:
- requestName: gpus
  amount: # one-of!
    all: true # all: false does not make sense, so this would be a `All bool `json:"all,omitempty"`
  deviceClass: gpu.dra.example.com
requests:
- requestName: gpus
  amount: # one-of!
    count: 4
  deviceClass: gpu.dra.example.com
requests:
- requestName: gpu
  # no amount: defaults to 1
  deviceClass: gpu.dra.example.com

By making this a one-of, we can add a range or optional allocation later without breaking old clients.

@johnbelamaric
Copy link
Contributor

To be clear: all here means "all on the node meeting the requirements, fail to schedule if you cannot get all". And it does differentiate that "management" doesn't allocate so it basically always succeeds.

Later we can add range, optional allocation, and even allAvailable to get "as many as possible", as needed.

lgtm!

@johnbelamaric
Copy link
Contributor

One thing - does this mean we need matchAttributes within a request? Or we can leave that out and add it next release? They can still do match attributes across requests (which in fact implies it within the request...so that's almost as good).

@pohly
Copy link
Contributor Author

pohly commented Jun 8, 2024

does this mean we need matchAttributes within a request

I had that in some earlier proposal. I think it's okay to leave it out because as you said, the claim-level constraints apply and should cover most cases.

@pohly
Copy link
Contributor Author

pohly commented Jun 8, 2024

There is one more complication that we need to warn users about: the size of slices in the API is limited, which limits how many devices can be allocated per claim (via the results slice). Even a single request can be too much if the count is too high. I'll add that to the docs in a PR for this issue and it might be worth adding some validation for this.

@pohly
Copy link
Contributor Author

pohly commented Jun 8, 2024

There's another problem with "all devices on a node": the way we have currently defined ResourcePool, a scheduler cannot be certain that it has seen all ResourcePool objects published by the driver. It might schedule a daemonset pod as soon as it has seen one device in one object, with access to just that one device, while the driver is still publishing 10 more.

I have some ideas how we can define our pool concept better to avoid that problem. I'll put that and the amount API into kubernetes/enhancements#4709 - probably tomorrow.

@pohly
Copy link
Contributor Author

pohly commented Jun 9, 2024

I updated kubernetes/enhancements#4709 with the revised pool concept and "amount" per request. Let's discuss directly there, I am not going to do a prototype PR.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024
@pohly pohly closed this as completed Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants