Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNF-15663: Full DU profile example #313

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

irinamihai
Copy link
Collaborator

No description provided.

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 15, 2024

@irinamihai: This pull request references CNF-15663 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Nov 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from irinamihai. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@irinamihai
Copy link
Collaborator Author

/hold cleanup

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2024
@irinamihai
Copy link
Collaborator Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2024
type: string
cluster-log-fwd-outputs:
type: string
cluster-log-fwd-pipelines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- or going a bit further I think this is static content that likely doesn't need to be part of the defaults either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been agreed that this will be locked in the new version of the ClusterLogForwarder, currently WIP under OCPBUGS-44518, so it will be removed from this ClusterTemplate.

policyTemplateParameters:
description: policyTemplateSchema defines the available parameters for cluster configuration
properties:
cluster-log-fwd-filters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a default

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, do you mean have them directly in the Policy Generator and not expose them in the policyTemplateParameters?

Copy link
Collaborator

@browsell browsell Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these would be set in the default configmap vs being passed in by the client

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As proposed above I think we should narrow this down to just the additional labels. One question I have is whether the user/orchestrator would add one or more labels which are cluster specific (like a higher level cluster identifier, etc)? In that case would the labels (or at least one label) need to be part of this schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filters are also going to be partially locked in in the ClusterLogForwarder source-cr under OCPBUGS-44518. Yes, these labels will be set in the ClusterInstance defaults ConfigMap, but we also need a way for them to reach the ConfigMap used by the ACM PGs, so they need to also be included in the policyTemplate defaults ConfigMap.

type: string
cluster-log-fwd-pipelines:
type: string
sriov-fec-bbDevConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default

type: string
sriov-fec-bbDevConfig:
type: string
sriov-fec-pciAddress:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default

type: string
sriov-fec-pciAddress:
type: string
sriov-fec-pfDriver:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default

# (e.g., "40m" for 40 minutes)
clusterConfigurationTimeout: "40m"
policytemplate-defaults: |
cluster-log-fwd-filters: '[{"name":"test-labels", "type": "openshiftLabels", "openshiftLabels": {"label1": "test1", "label2": "test2"}}]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really filters, additional metadata labels

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to narrow the templating down to just the additional labels, ie the user configures only the value for openshiftLabels?

type: string
hugepages-count:
type: string
machine-config-storage-source-1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove, see comment above

type: string
machine-config-storage-source-1:
type: string
machine-config-storage-source-2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

type: string
hugepages-count:
type: string
machine-config-storage-source-1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

type: string
machine-config-storage-source-1:
type: string
machine-config-storage-source-2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Member

@lack lack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, if this is something that will need to be kept in-sync with the cnf-features-deploy repo, perhaps it's worth engineering a way to automatically synchronize them or generate one from the other.

clusterConfigurationTimeout: "40m"
policytemplate-defaults: |
cluster-log-fwd-filters: '[{"name":"test-labels", "type": "openshiftLabels", "openshiftLabels": {"label1": "test1", "label2": "test2"}}]'
cluster-log-fwd-outputs: '[{"type":"kafka","name":"kafka-open", "kafka": {"url":"tcp://10.46.55.190:9092/test"}}]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow customization of all of this? Or just the Kafka url?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

url only

Comment on lines +144 to +148
additionalKernelArgs:
- rcupdate.rcu_normal_after_boot=0
- vfio_pci.enable_sriov=1
- vfio_pci.disable_idle_d3=1
- efi=runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't override this section, but rely on the source-crs original value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
This is also missing the module_blacklist=irdma

Comment on lines +160 to +163
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/master: ""
nodeSelector:
node-role.kubernetes.io/master: ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/master: ""
nodeSelector:
node-role.kubernetes.io/master: ''
machineConfigPoolSelector:
$patch: replace
pools.operator.machineconfiguration.openshift.io/master: ""
nodeSelector:
$patch: replace
node-role.kubernetes.io/master: ''

And then we don't need the SetSelector cr variant any more.

(Repeat for *-SetSelector.yaml elsewhere in this file!)

Comment on lines +181 to +182
phc2sysOpts: -a -r -n 24
ptp4lOpts: -2 -s --summary_interval -4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't override these; the source-crs has the right values.

Comment on lines +250 to +274
- path: source-crs/MachineConfigGeneric.yaml
complianceType: mustonlyhave # This is to update array entry as opposed to appending a new entry.
patches:
- metadata:
name: 02-master-workload-partitioning
spec:
config:
storage:
files:
- contents:
# crio cpuset config goes below. This value needs to be updated and matched with PerformanceProfile. Check the link for more info on the content.
source: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "machine-config-storage-source-1" hub}}'
mode: 420
overwrite: true
path: /etc/crio/crio.conf.d/01-workload-partitioning
user:
name: root
- contents:
# openshift cpuset config goes below. This value needs to be updated and matched with crio cpuset (array entry above this). Check the link for more info on the content.
source: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "machine-config-storage-source-2" hub}}'
mode: 420
overwrite: true
path: /etc/kubernetes/openshift-workload-pinning
user:
name: root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think any of this is needed any more since the SiteConfig added cpuPartitioningMode: AllNodes in 4.14

complianceType: musthave
patches:
- spec:
configDaemonNodeSelector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be in the source cr ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The selector can't be, because depending on whether you're deploying SNO or MNO the source CR may need master or worker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this model, the cluster template would only be used for SNO, a MNO would have a different one,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SNO we should be able to use either master or worker here since the node has both labels. If we use worker is it valid for all topologies?

summary=Configuration changes profile inherited from performance created tuned
include=openshift-node-performance-openshift-node-performance-profile
[bootloader]
cmdline_crash=-tsc=nowatchdog
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to override the default profile? 🤔 The ztp git example is just using the default profile values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to override the source cr here

include=openshift-node-performance-openshift-node-performance-profile
[bootloader]
cmdline_crash=-tsc=nowatchdog
cmdline_crash1=tsc=reliable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete

cmdline_crash=-tsc=nowatchdog
cmdline_crash1=tsc=reliable
[sysctl]
kernel.timer_migration=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete

cmdline_crash1=tsc=reliable
[sysctl]
kernel.timer_migration=1
kernel.sysrq=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete

[service]
service.stalld=start,enable
service.chronyd=stop,disable
# MACHINE CONFIG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete


For details about setting up the Git repo, please refer to the the Gitops setup [README.md](./samples/git-setup/README.md).

**Note:** Make sure all the value used in hub templates in the PGs are exposed in the corresponding ClusterTemplate, under `spec.templateParameterSchema.policyTemplateParameters` and are present either in the `spec.templates.policyTemplateDefaults` ConfigMap or are specified through the ProvisioningRequest (`spec.templateParameters.policyTemplateParameters`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: values

@@ -2,7 +2,8 @@ generators:
- sno-ran-du/sno-ran-du-pg-v4-Y-Z-v1.yaml
# This ACM PG is needed when the previous one has to be updated.
- sno-ran-du/sno-ran-du-pg-v4-Y-Z-v2.yaml

- sno-ran-du/sno-ran-du-pg-v4-Y-Z-v3.yaml
- sno-ran-du/sno-ran-du-pg-v4-Y-Z-v4-full-DU.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't really a progression from v3, but a whole new PG, would it be better to create new sno-ran-full-du directories for the cluster templates and policy templates and start them at v1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

Comment on lines +33 to +38
- data:
config.yaml: |
alertmanagerMain:
enabled: false
telemeterClient:
enabled: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, observability is not enabled. I think we should just use the default values from source-cr.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should enable observability.

@@ -0,0 +1,376 @@
apiVersion: policy.open-cluster-management.io/v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this DU profile based on the OCP 4.17? I wonder if adding a comment to mention that might be helpful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

name: redhat-operators
spec:
displayName: redhat-operators
image: registry.redhat.io/redhat/redhat-operator-index:v4.Y
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use disconnected registry as example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment on lines +18 to +19
ManagedCluster:
test-annotation: test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

- name: clustertemplate-sample.v1.0.0-extramanifests
nodes:
- role: master
bootMode: UEFI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference is now UEFISecureBoot.

networkType: OVNKubernetes
sshPublicKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDTca4Qyu5AYBmZbSl74cNTKuNINJ7d+ceBRzKUrhHcQpMbl8UnAYhjh/ffTyVCsgwzm1RjTAm6/tPj9euEa+YX4U78Sx+ioLHmjDvACYsti4DekIR+opFwfIw+JTDXoyVv06lOPaTOa/vtgpe+gDEL364j47f3p9H/tGhsLmpjeG3DVAhbqSh3s0IHpd4OzF/r6g6mbPyHadvedkBZp/qeUX054Gc2QqJeg/s/eddPlQDJbmL8yRVkZu+SsFTOEOAtrdA3czeaEaA8s+aWP9PN3X539Ddw3qahyOSCXpCE2eJXPh8DJCBWVEcFFYgmIFVvCQ+o9cjEmIYg6drGGvRV
installConfigOverrides: '{"capabilities": {"baselineCapabilitySet": "None", "additionalEnabledCapabilities": ["NodeTuning", "OperatorLifecycleManager", "Ingress"]}}'
ignitionConfigOverride: '{"ignition": {"version": "3.2.0"}, "storage": {"files": [{"overwrite": true, "path": "/etc/containers/policy.json", "contents": {"source":"data:text/plain;base64,ewogICAgImRlZmF1bHQiOiBbCiAgICAgICAgewogICAgICAgICAgICAidHlwZSI6ICJpbnNlY3VyZUFjY2VwdEFueXRoaW5nIgogICAgICAgIH0KICAgIF0sCiAgICAidHJhbnNwb3J0cyI6CiAgICAgICAgewogICAgICAgICAgICAiZG9ja2VyLWRhZW1vbiI6CiAgICAgICAgICAgICAgICB7CiAgICAgICAgICAgICAgICAgICAgIiI6IFt7InR5cGUiOiJpbnNlY3VyZUFjY2VwdEFueXRoaW5nIn1dCiAgICAgICAgICAgICAgICB9CiAgICAgICAgfQp9Cgo="}}]}}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IBU there is a need for a separate partition for /var/lib/containers. Should this example set that up as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch

Comment on lines +72 to +74
name: bond99
state: up
type: bond

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an intent to include ports eth0 and eth1 in this bond, or other ports in it?

# (e.g., "40m" for 40 minutes)
clusterConfigurationTimeout: "40m"
policytemplate-defaults: |
cluster-log-fwd-filters: '[{"name":"test-labels", "type": "openshiftLabels", "openshiftLabels": {"label1": "test1", "label2": "test2"}}]'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to narrow the templating down to just the additional labels, ie the user configures only the value for openshiftLabels?

Comment on lines +215 to +217
outputs: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "cluster-log-fwd-outputs" | toLiteral hub}}'
pipelines: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "cluster-log-fwd-pipelines" | toLiteral hub}}'
filters: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "cluster-log-fwd-filters" | toLiteral hub}}'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above we should use more fixed content in the source CR (ie be more opinionated) and allow the user to override the URL and labels.

Comment on lines +218 to +219
serviceAccount:
name: collector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already in source CR

Comment on lines +241 to +242
kernel.panic_on_rcu_stall=1
kernel.hung_task_panic=1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not in the reference. Is their addition intentional?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

kernel.hung_task_panic=1
[scheduler]
group.ice-ptp=0:f:10:*:ice-ptp.*
group.ice-gnss=0:f:10:*:ice-gnss.*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing:
group.ice-dplls=0:f:10:*:ice-dplls.*

name: root
- name: v4-sriov-config-policy
manifests:
# SRIOV

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these we should not repeat any of the content already in the source CR.

orderPolicies: true
policies:
# REDUCE FOOTPRINT
- name: v4-footprint-policy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is in its own policy? This creates more policies than are necessary. Consider combining with the next policy as "baseline config"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this can be included in the v4-config-policy.

path: /etc/kubernetes/openshift-workload-pinning
user:
name: root
- name: v4-sriov-config-policy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could sriov configs be included in the v4-config-policy? I don't see the reason why they couldn't be🤔 .

enabled: false
ipv4:
enabled: false
name: bond99
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the bond, bonding is going to be very rare for this use case

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 27, 2024
Copy link

openshift-ci bot commented Nov 29, 2024

@irinamihai: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 1629354 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants