-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor checkpointing #51
base: dra-1.31
Are you sure you want to change the base?
Conversation
b69d976
to
1e9402b
Compare
13c74c9
to
85f0965
Compare
1e9402b
to
8693ebc
Compare
818c6c1
to
a011083
Compare
8693ebc
to
ce39107
Compare
ce39107
to
5452056
Compare
fb59476
to
68ad8b2
Compare
// plugin should be invoked to process this ResourceHandle's data once it | ||
// lands on a node. This may differ from the DriverName set in | ||
// ResourceClaimStatus this ResourceHandle is embedded in. | ||
DriverName string `json:"driverName,omitempty" protobuf:"bytes,1,opt,name=driverName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need protobuf, do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the json
tag also can be dropped. It should be able to decode 1.30 files without it (JSON isn't case-sensitive, if I'm not mistaken).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
) | ||
|
||
// V1 data structures are copies of the correspondent structures from state_checkpoint.go (k/k 1.30) | ||
// This is done to avoid breaking changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the goal is to define two versions as a proof-of-concept, right?
Some commit message explaining why you are making these changes would help with a review.
Also, this isn't complete yet, is it? I was looking for the code that switches between the two versions depending on what is in a checkpoint file, but haven't found it.
cc @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually an older comment. Might be obsolete now, I haven't checked again yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, obsolete as the rest of the comments due to the latest update.
@@ -85,9 +85,6 @@ func (manager *impl) GetCheckpoint(checkpointKey string, checkpoint Checkpoint) | |||
return err | |||
} | |||
err = checkpoint.UnmarshalCheckpoint(blob) | |||
if err == nil { | |||
err = checkpoint.VerifyChecksum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart0sh: can you add a commit message explaining the "what" and "why" of this commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we moved to another approach this is not needed anymore. checksum is always verified. see details in the types.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart0sh I'm not quite sure what you mean? This is in the generic library code, and Patrick made changes to the devicemanager and the cpumanager to adhere to this API. I don't think we can just remove this because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. This is an old code and it's not in this PR anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you forgot to remove it? I am commenting on this from the "Files Changed" tab of the PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad. removed. Thank you for pointing out.
5452056
to
70a9844
Compare
43567fc
to
d11b58e
Compare
6bdb873
to
19b170b
Compare
19b170b
to
c94f83f
Compare
Unlike kubelet/apis/config, this is an internal-only API, so it doesn't need internal and external parts. When we create v2, the conversion-gen machinery should be able to generate v1->v2 conversion code. @pohly Is this a correct assumption? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file, even if we don't ever plan to expose this over the API server? Seems silly to have a whole file just for this one constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't. And we don't need doc.go either it seems. I'm going to remove both and also move APIs(New, MarshalCheckpoint, etc) out of this directory to avoid duplicating them for every API version.
"k8s.io/kubernetes/pkg/kubelet/checkpointmanager" | ||
"k8s.io/kubernetes/pkg/kubelet/checkpointmanager/errors" | ||
checkpointapi "k8s.io/kubernetes/pkg/kubelet/cm/dra/checkpoint/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not just alias this to checkpoint
? Calling checkpointapi.New()
is more awkward than simply calling checkpoint.New()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do that, but we may need to import checkpoint APIs from the k8s.io/kubernetes/pkg/kubelet/cm/dra/checkpoint
. How would you suggest to call this import?
pc := &Checkpoint{ | ||
TypeMeta: metav1.TypeMeta{ | ||
Kind: "DRACheckpoint", | ||
APIVersion: "checkpoint.dra.kubelet.k8s.io/v1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the GroupName
variable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean moving GroupName
constant from register.go here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added GroupName
constant to the types.go and used it in the api.go
pkg/kubelet/cm/dra/claiminfo.go
Outdated
checkpoint "k8s.io/kubernetes/pkg/kubelet/cm/dra/checkpoint" | ||
checkpointapi "k8s.io/kubernetes/pkg/kubelet/cm/dra/checkpoint/v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect us to have to pull both of these in here. Why is ClaimInfoStateList
defined in checkpoint
but ClaimInfoState
is defined in checkpoint/v1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClaimInfoStateList is here to avoid duplication in every API version. It's just a helper type for the list of ClaimInfoStates. If you think it shouldn't be outside of checkpoint/ I can either move it there or get rid of it and simply use []checkpoint.ClaimInfoState instead of it. Which way would you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect us to have to pull both of these in here.
that's a result of getting rid of state
subdirectory, I guess. Should I return it back and put state_checkpoint* files there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved state_checkpoint.go and its tests to the pkg/kubelet/cm/dra/state/
. Does it look better now?
I would think that the code in |
I am not sure. The code generator should be able to do it, but I am less sure about how to configure the scheme to use such a direct conversion path. I know for "normal" APIs, the conversion path is always "version A" <-> "internal" <-> "version B". That means conversion back and forth is only needed between each version and the internal representation. I've not seen it being used with just conversion from "version A" -> "version B". No matter whether this works or not, we can start with just encoding/decoding "version A", without any conversion. Then if, and only if we need conversion, we can add it, either with such an internal representation or directly. |
That was my intention. I didn't do any conversion as we have only one version. I can try to add a new verson locally and see if conversion-gen is able to create v1->v2 and v2->v1 conversion, but in my understanding it should. That's why I think that we don't need "internal" part of the API, it would be just a duplication of the latest API version. |
6e92d20
to
02a5ad3
Compare
@pohly btw, do we really ever need to convert from v2 to v1? to do that we'd need to backport v2 types and conversion code to the previous kubelet version as far as I understand. Please, clarify. |
02a5ad3
to
18b7cde
Compare
I think we need to support downgrades. It's not okay if kubelet v1.32 encounters a checkpoint file written by kubelet v1.33 and then refuses to start. That's the opposite direction of what we encountered before, but it's also a problem. To answer your question: writing a checkpoint v2 in kubelet v1.33 and then failing to handle it in kubelet v1.32.0 (no backporting!) has that problem, so it that was the plan, then we need to reconsider. |
9cab240
to
649411e
Compare
@pohly Is it ok to use versioned checkpoint structures (e.g ClaimInfoStateList) in the manager.go and claiminfo.go or we should avoid it? If yes, how to better do it? Should we use "internal" API and conversions for this? |
74af2e5
to
38ed8c2
Compare
// Device is how a DRA driver described an allocated device in a claim | ||
// to kubelet. RequestName and CDI device IDs are optional. | ||
// +k8s:deepcopy-gen=true | ||
type Device struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For symmetry...
type Device struct { | |
type DeviceState struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I also rename DriverState.Devices
-> DriverState.DeviceStates
?
Store(state.ClaimInfoStateList) error | ||
} | ||
|
||
type checkpointer struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be called checkpointerV1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it should always work with the latest API version.
Co-authored-by: Kevin Klues <[email protected]>
38ed8c2
to
01e0c79
Compare
@klueska @pohly let's continue in the k/k upstream PR: kubernetes#126303 |
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it:
Current implementation doesn't guarantee that checkpointing would work after changing DRA data structures, which was explained in the @123552
This PR changes checkpointing implementation by isolating DRA checkpointing into a separate API maintained as other Kubernetes APIs.
Which issue(s) this PR fixes:
Hopefully fixes kubernetes#123552
Special notes for your reviewer:
This is just a first step, a base for further work. The next step would be checkpointing data optimisation, i.e. only the data that's required should be checkpointed, which is currently not the case.