Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dra-evolution: restore minimal support for "classic DRA" #26

Closed
wants to merge 4 commits into from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jun 6, 2024

"Classic DRA" has a control plane controller which handles allocation and deallocation of a claim in cooperation with the scheduler. That is the part which gets added back to the API.

What doesn't get restored compared to v1.30.0 is:

  • immediate allocation
  • using a CRD as user-facing API
  • the opaque "resource handle" in the allocation result

Drivers which want to support this mode of operation have to use the in-tree types. The "config" fields support that, albeit perhaps with a slightly less natural API.

While at it, some occurrences of "resource" get replaced with "device" or "DRA".

/assign @thockin @klueska

"Classic DRA" has a control plane controller which handles allocation and
deallocation of a claim in cooperation with the scheduler. That is the part
which gets added back to the API.

What doesn't get restored compared to v1.30.0 is:
- immediate allocation
- using a CRD as user-facing API
- the opaque "resource handle" in the allocation result

Drivers which want to support this mode of operation have to use the in-tree
types. The "config" fields support that, albeit perhaps with a slightly less
natural API.

While at it, some occurrences of "resource" get replaced with "device" or
"DRA".
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 6, 2024
Data *StructuredDriverData `json:"data,omitempty" protobuf:"bytes,2,opt,name=data"`

// Alternative to "Data", not describe further here for the sake of brevity:
// OpaqueData *string // Set by control plane controller when using "classic DRA".
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came to the conclusion that we don't need this opaque data and thus StructuredDriverData can be inlined.

A control plane controller can and should use instead the in-tree RequestAllocationResult. Then selecting a specific request for a container works the same way in both allocation modes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm reading this diff correctly, the only fields in this struct now are DriverName and Config? Is this because the new Config section is (under the hood) just a RawExtension, so that can now be used to encapsulate any opaque data that needs to be passed through?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the idea.

This moves the "results" list up in the AllocationResult, which matches the
"requests" in the spec. The claim.spec.onfig directly corresponds to
allocationResult.config.
Comment on lines 404 to 406
// This node name together with the driver name and
// the device name field identify which device was allocated.
NodeName string `json:"nodeName"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's generalize this a bit more so that it also works for network-attached devices, whether they get allocated through a control plane controller (this would be possible in 1.31) or with some future extension of structured parameters.

Suggested change
// This node name together with the driver name and
// the device name field identify which device was allocated.
NodeName string `json:"nodeName"`
// This name together with the driver name and
// the device name field identify which device was allocated.
//
// For a driver with node-local devices, it is the name of the node
// and the overall ID is `<driver name>/<segment name = node name>/<device name>`.
//
// For a driver where device names are unique in a certain subset of
// the cluster (region, rack, etc.), the name identifies that segment.
//
// For a driver where device names are unique in the cluster, this
// name may be empty.
//
// +optional
SegmentName string `json:"segmentName,omitempty"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn between allowing just one intermediate segment and many. It could make sense to allow <driver name>/<region>/<data center>/<device name> as the string ID for a device.

We want to be sure that there is a driver name and device name, but what the driver puts into the middle doesn't really matter.

Naming, oh why do you have to be so difficult?!

Using "PoolName" is tempting ("names the pool in which the device name must be unique"), but it clashes with ResourcePool.Name.

DevicePoolName? Max length 128 characters, DNS sub-domains separated by slashes. Optional.

That goes well together with DeviceName.

Once we get to structured parameters for network-attached devices, a ResourcePool object for those would have:

  • a NodeSelector *v1.NodeSelector instead of NodeName string`.
  • such a DevicePoolName that serves as part of the unique ID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to allow a node name, so the max length has to be 253.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed this as a third commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's how this would work together with a ResourcePool (but note my question about that name:

// ResourcePool represents a collection of devices managed by a given driver. How
// devices are divided into pools is driver-specific, but typically the
// expectation would a be a pool per identical collection of devices, per node.
// It is fine to have more than one pool for a given node, for the same driver.
//
// Where a device gets published may change over time. The unique identifier
// for a node-local device is the tuple `<driver name>/<node name>/<device name>`. Each
// of these names is a DNS label or domain, so it is okay to concatenate them
// like this in a string with a slash as separator.
//
// For non-local devices, the driver can either make all device names globally
// unique (`<driver name>/<device name>`) or provide a device pool name
// (`<driver name>/<device pool name>/<device name>`).

type ResourcePoolSpec struct {
// NodeName identifies the node which provides the devices. All devices
// are local to that node.
//
// Node name and device pool are mutually exclusive. At least one must be set.
NodeName string `json:"nodeName, omitempty"`
// DevicePool can be used for non-instead of a NodeName to define where the devices
// are available.
//
// Node name and device pool are mutually exclusive. At least one must be set.
DevicePool *DevicePool `json:"devicePool,omitempty"`
// POTENTIAL FUTURE EXTENSION: NodeSelector *v1.NodeSelector
// DriverName identifies the DRA driver providing the capacity information.
// A field selector can be used to list only ResourcePool
// objects with a certain driver name.
//
// Must be a DNS subdomain and should end with a DNS domain owned by the
// vendor of the driver.
DriverName string `json:"driverName" protobuf:"bytes,3,name=driverName"`
// Devices lists all available devices in this pool.
//
// Must not have more than 128 entries.
Devices []Device `json:"devices,omitempty"`
// FUTURE EXTENSION: some other kind of list, should we ever need it.
// Old clients seeing an empty Devices field can safely ignore the (to
// them) empty pool.
}
const ResourcePoolMaxDevices = 128
// DevicePool is a more general description for a set of devices.
// A single node is a special case of this.
type DevicePool struct {
// This name together with the driver name and the device name
// identify a device.
//
// Must not be longer than 253 characters and may contain one or more
// DNS sub-domains separated by slashes.
//
// +optional
Name string `json:"name,omitempty"`
// This identifies nodes where the devices may be used. If unset,
// all nodes have access.
//
// +optional
NodeSelector *v1.NodeSelector `json:"nodeSelector,omitempty"`
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it super confusion to have both ResourcePool and DevicePool. I think we can find a better name.

Based on the discussion in the other comment, the purpose of this is to allow the controller to identify the "owner" of the device. In the case of node-local, that is always the node so we didn't need anything but node name. With networked devices, this is controller-specific and effectively an opaque (to K8s) identifier. I can see where you are going with "device pool" in that sense, but I think it's too confusing. So, let's brainstorm some other ideas. What about something of this form: {"",Device,Resource} + {Manager,Owner,Driver} + {Id,Identifier,Key,Token,OpaqueId}. So, like, DriverOpaqueId would be one possible name based on that. Other ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I really like DriverOpaqueId. It makes it clear it is 1) driver-specific; 2) opaque to k8s (not a resource name for example); 3) an identifier

Copy link
Contributor Author

@pohly pohly Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the naming of ResourcePool needs further thought. The current name implies "this is one pool" but it's really just a subset of some other, larger pool, the one which contains all devices provided by the same driver name and published with the same "DriverOpaqueID" (or whatever we call it).

I would stay with the ResourceSlice type and define it like this:

Devices are identified by the name of the driver, a pool name, and a unique name inside that pool. A driver publishes information about its resources in one or more ResourceSlice objects. Each slice is part of one pool and contains the common attributes of a pool (the pool name, which nodes have access to the devices in the pool) and some of the devices in the pool. The common attributes are provided in each slice to enable correlating them (via the driver name and pool name) and to provide all relevant information without having to look up other objects.

Another common attribute is the total number of devices in the pool. This information enables consumers to determine whether they have seen all slices of a pool.

A slices belonging to the same pool must share the same generation number (TODO: can we use ObjectMeta.Generation for this or do we have to define our own field?). When a consumer sees slices of different generations, it should ignore all old slices. This can happen while the driver is updating the pool.

Pools that are local to a node are special. They get published with a node name, which then also serves as pool name. There are special permission checks that limit access from a node to slices with the same node name. Provisioning a new node of the same type is likely to provide the same set of devices as some old node.

With this definition, we can safely implement "give me access to all devices" because scheduler knows when it lacks information. The "pool name" is meaningful and won't get confused with the name of a ResourcePool.

This begs the question whether we should have a separate "pool" type. I don't think we need it, some small amount of duplicated information IMHO is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh...yes, this makes sense to me. What if we took this a little further, and we do have two types: DevicePool and ResourceSlice. Instead of "Pools that are local to a node are special. They get published with a node name, which then also serves as pool name.", we could have the idea of a "DriverInstance" which could be the node name for node-local devices, but could vary with other types of devices. So, it would be the four values: (DriverName, DriverInstance, DevicePoolName, DeviceName) that identifies an individual device. As you say, ResourceSlice would be the mechanism by which we publish the elements of the DevicePool. We could publish a DevicePool resource as well, and it would include a total device count and/or a slice count, and a generation ID as you say, so that we have this one object that we can use to ensure consistency across all the slices for that pool. We could also normalize common attributes up to the pool, as we have discussed before; I don't think we would need to normalize them up to the slice, just to the pool.

So - I think this is a good model. What I really like is it weakens the "node" boundary in a way I think will be important in the long run (it's even in our charter...). What I don't like, is that it requires a lot of explanation. I think if we omit the DevicePool object itself, it will be harder for people to understand. So I would want to have that in there. I think it also helps with revision control - it lets us know what the current generation is. Otherwise, I imagine some race conditions when we just say it is "the latest one we have seen".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added these ideas with a different wording to kubernetes/enhancements#4709.

Let's discuss there.

@johnbelamaric johnbelamaric mentioned this pull request Jun 6, 2024
Calling this "NodeName" didn't quite capture the purpose. It was correct for
node-local devices, but only for those.
- deviceClassName: foozer.example.com
- deviceClassName: sriov-nic-example.org
- name: gpu
deviceClassName: foozer.example.com
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to make request names mandatory. Not all examples have been updated, so I am doing it here now because I am touching it.

pohly added a commit to pohly/enhancements that referenced this pull request Jun 7, 2024
This is the result of the prototyping in
https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution.

The key changes compared to 1.30 are:
- Removal of CRD support. Vendor configuration parameters
  are still supported as raw extensions of the in-tree types.
- Removal of the notion of a "device model" and a selection mechanism
  that only works for one such model. This simplifies the API.
- Instead of one claim referencing a ResourceClass, individual
  requests inside a claim now reference a DeviceClass.
- Support for "match attributes".
- Support for "management access".
- Support for selecting devices from a request in a claim for a
  container instead of exposing all devices of a claim.

Editorial changes:
- Reduced usage of the word "resource" in favor of "devices" and
  "DRA drivers".
- Removal of some out-dated text.

This revision of the KEP roughly corresponds to
https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution
(= kubernetes-sigs/wg-device-management#14 with some
minor edits).

What is missing:
- renaming of ResourceSlice ->
  ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004)

What was added from kubernetes-sigs/wg-device-management#26:
- ResourceClaim.Status.Allocation.NodeName -> NodeSelector
- flattening of the allocation result

Undecided:
- kubernetes-sigs/wg-device-management#26
- kubernetes-sigs/wg-device-management#16
- kubernetes-sigs/wg-device-management#18
Comment on lines +382 to +388
// DriverName specifies the name of the DRA driver whose kubelet
// plugin should be invoked to process the allocation once the claim is
// needed on a node.
//
// Must be a DNS subdomain and should end with a DNS domain owned by the
// vendor of the driver.
DriverName string `json:"driverName" protobuf:"bytes,1,name=driverName"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this both here and in DriverData? Speaking of which, I don't see DriverData embedded in any of the other types here. Remind me where it gets referenced again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DriverName became obsolete because I moved all of its fields directly into AllocationResult. I've removed it now.

// DNS sub-domains separated by slashes.
//
// +optional
DevicePoolName string `json:"devicePoolName,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see this I think of out whole ResourcePool and DevicePool discussion and it makes me think this is referring to the name of the ResourcePool that the device in this result was allocated from. I think it is probably meant to be something different though, but the previous discussions are carrying too much baggage for me to understand exactly what that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +436 to +443
// DriverConfigurationParameters must have one and only one one field set.
//
// In contrast to ConfigurationParameters, the driver name is
// not included and has to be infered from the context.
type DriverConfigurationParameters struct {
Opaque *runtime.RawExtension `json:"opaque,omitempty" protobuf:"bytes,1,opt,name=opaque"`
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment. The ConfigurationParameters and DriverConfigurationParameters structs look identical to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I see the difference now -- the other struct has an embedded OpaqueConfigurationParameters with an embedded DriveName field. rather than a direct field for runtime.RawExtension.

It's still confusing to look at, but I at least see the difference now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to repeat the driver name, i.e. use ConfigurationParameters?

That could look odd in YAML because the driver name will be more obviously redundant:

results:
- driverName: dra.example.com
  config:
  - opaque:
      driverName: dra.example.com
      parameters:
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it actually makes sense? One driver could understand the parameters of another.

We don't do that with structured parameters (it's defined as "only configs for the driver are passed down to it"), but perhaps that could change and a control plane controller could already do it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not added yet to kubernetes/enhancements#4709 but perhaps we should?

If we do, we probably should make it consistent and also pass down all configurations down when using structured parameters.

pohly added a commit to pohly/enhancements that referenced this pull request Jun 11, 2024
This is the result of the prototyping in
https://github.com/kubernetes-sigs/wg-device-management/tree/main/dra-evolution.

The key changes compared to 1.30 are:
- Removal of CRD support. Vendor configuration parameters
  are still supported as raw extensions of the in-tree types.
- Removal of the notion of a "device model" and a selection mechanism
  that only works for one such model. This simplifies the API.
- Instead of one claim referencing a ResourceClass, individual
  requests inside a claim now reference a DeviceClass.
- Support for "match attributes".
- Support for "management access".
- Support for selecting devices from a request in a claim for a
  container instead of exposing all devices of a claim.

Editorial changes:
- Reduced usage of the word "resource" in favor of "devices" and
  "DRA drivers".
- Removal of some out-dated text.

This revision of the KEP roughly corresponds to
https://github.com/kubernetes-sigs/wg-device-management/tree/258d109f54d3baaa48719519dec37a450a1dd5a1/dra-evolution
(= kubernetes-sigs/wg-device-management#14 with some
minor edits).

What is missing:
- renaming of ResourceSlice ->
  ResourcePool (https://github.com/kubernetes-sigs/wg-device-management/pull/28/files#r1630964004)

What was added from kubernetes-sigs/wg-device-management#26:
- ResourceClaim.Status.Allocation.NodeName -> NodeSelector
- flattening of the allocation result

Undecided:
- kubernetes-sigs/wg-device-management#26
- kubernetes-sigs/wg-device-management#16
- kubernetes-sigs/wg-device-management#18
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024
@pohly pohly closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants