-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s control plane high-availability mode #940
base: main
Are you sure you want to change the base?
Conversation
ae1245e
to
0160c33
Compare
0160c33
to
6183b2c
Compare
I think we need to introduce versioning of vHive in Invitro repo so that we won't need this note. Because it is not only about this PR but also the ongoing knative and k8s upgrade in vHive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty good. I left some questions below about the changes, most are just to confirm that it works as expected (I didn't get deep into the HA setting documentation). The most important part is to document these settings in the qiuckstart or place it in separate .md file with links from quickstart. Also, add line into changelog about it; it seems important enough.
@@ -71,43 +73,22 @@ func CreateMultinodeCluster(stockContainerd string) error { | |||
return nil | |||
} | |||
|
|||
// Create kubelet service on master node | |||
func CreateMasterKubeletService() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need need it? Does it automatically attach to unix:///run/containerd/containerd.sock
? That's the most important part here. But also, we are loosing the timeout and verbosity parameters. Do we need them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because kubeadm
starts kubelet
process automatically, hence there is no need to start kubelet
manually. Regarding the containerd
socket, this is now provided as argument to kubeadm init
command.
@leokondrashov: What is the socket path for Firecracker? Should we make this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't create the kubelet, but rather sets it up with parameters. The socket on the master node doesn't need to be configurable since we use Firecracker only for user containers. But the verbosity and runtime-request-timeout set inside might be used for something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. You are right about not starting the kubelet
process.
For the runtime-request-timeout
parameter, it seems to me we can safely remove this (see DOCS). This should mean that each request from the control plane to the kubelet should be executed within 2 minutes (default), rather than 15 minutes.
The container-runtime
has been deprecated since v1.24 according to this.
@@ -1,87 +1,116 @@ | |||
# vHive Setup Scripts | |||
|
|||
- [vHive Setup Scripts](#vhive-setup-scripts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this syllabus completely, since github has automatic one.
|
||
For fault tolerance purposes, Kubernetes control plane components can be replicated. This can be done by combining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still better have a description of what and how options are used (specifically, REGULAR, which appears throughout the docs).
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
Signed-off-by: Lazar Cvetković <[email protected]>
@cvetkovic Hi, what's the status of this PR? My point about the documentation still stands. We should have instructions on how to use the tools in this repo, not just references to general k8s help and a way to use it from Invitro. For example, what would be the steps to setup the HA control plane without all the other things we have in Invitro? |
// Original Bash Scripts: scripts/cluster/create_multinode_cluster.sh | ||
|
||
if err := CreateMasterKubeletService(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command fixes #967 for the first master node. Without it, we still use the control IP on the master node if we use CloudLab, which can lead to the suspension of the account if traffic is big enough.
All other master nodes use CreateWorkerKubeletService, which has the fix for this problem. But it has a different one: in the Firecracker setup, we use our CRI socket, which we don't need to use on controller nodes.
Please consider running CreateMasterKubeletService for all master nodes.
UPD: running for each node is up to invitro setup scripts, not vhive's, but we need to make it possible from here. And there is not much difference between the master and worker commands. I think we can combine them into a single command setup_kubelet with a sandbox type as an argument. When setting up the node, we always choose containerd for all master and backup nodes and change it only for regular nodes.
--apiserver-advertise-address=%s \ | ||
--cri-socket unix:///run/containerd/containerd.sock \ | ||
|
||
command := fmt.Sprintf(`sudo kubeadm init --v=%d \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add --apiserver-advertise-address
here and for kubeadm join
of control plane nodes. Right now, all the IPs that loadbalancer uses are control network IPs, which is bad (See comment about kubelet creation).
shellOut, err := utils.ExecShellCmd("sed -n '/.*kubeadm join.*/p' < %s/masterNodeInfo | sed -n 's/.*join \\(.*\\):\\(\\S*\\) --token \\(\\S*\\).*/\\1 \\2 \\3/p'", configs.System.TmpDir) | ||
if !utils.CheckErrorWithMsg(err, "Failed to extract master node information from logs!\n") { | ||
if !utils.CheckErrorWithMsg(err, "Failed to extract API Server address, port, and token from logs!\n") { | ||
return err | ||
} | ||
splittedOut := strings.Split(shellOut, " ") | ||
configs.Kube.ApiserverAdvertiseAddress = splittedOut[0] | ||
configs.Kube.ApiserverPort = splittedOut[1] | ||
configs.Kube.ApiserverToken = splittedOut[2] | ||
|
||
// API Server discovery token | ||
shellOut, err = utils.ExecShellCmd("sed -n '/.*sha256:.*/p' < %s/masterNodeInfo | sed -n 's/.*\\(sha256:\\S*\\).*/\\1/p'", configs.System.TmpDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commands parsing tmp/masterNodeInfo
return double lines of same info because in the output of kubeadm init
there are two lines with kubeadm join
(for control plane and workers). Because of that, masterKey.yaml is malformed.
Hello, what is the status of this PR? Can we merge it now? @leokondrashov |
NOTE: Only merge this PR in conjunction with vhive-serverless/invitro#379.
Modified setup scripts such that Kubernetes control plane can be deployed in high-availability mode, i.e. mode where etcd, kube-scheduler, kube-controller-manager, and etcd are replicated for fault tolerance purposes. The load balancer sits in front of the API Server replicas and is reachable through the virtual IP address - 10.0.1.254. The setup favors Cloudlab and assumes that all nodes of the cluster are part of 10.0.1.0/24.
I tested the setup on a 5-node xl170 Cloudlab cluster with 1 and 3 control plane replicas. The cluster deploys without any issues and seems to be working fine.
@ustiugov, @leokondrashov: Can you once more check to make sure everything is fine here? I see a lot of changes were made in the setup scripts that I might not be exactly aware of.
The setup visually looks as in the following image.