Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VM Scale Management Steps #861

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

ebattat
Copy link
Collaborator

@ebattat ebattat commented Aug 5, 2024

Type of change

Note: Fill x in []

  • bug
  • enhancement
  • documentation
  • dependencies

Description

[Chaos Testing] Adding VM Scale Management Steps:

  • Run VMs without deleting them before cluster upgrade (Add DELETE_ALL variable)
  • Verify VMs only after cluster upgrade (Add VERIFICATION_ONLY variable)
  • Add vm_ssh field to verify that SSH into VM is working properly

For security reasons, all pull requests need to be approved first before running any automated CI

@ebattat ebattat added the enhancement New feature or request label Aug 5, 2024
@ebattat ebattat requested a review from RobertKrawitz August 5, 2024 12:35
@ebattat ebattat self-assigned this Aug 5, 2024
@ebattat ebattat added the ok-to-test PR ok to test label Aug 5, 2024
:return:
"""
try:
vm_names = self._oc._get_all_vm_names()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to compare this against a list of expected VM names, if that's practical here. And you certainly want to make sure that there is at least one, or you'll simply be "verifying" a null case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You right, add MissingVMs error when no running VMs at all

if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port):
logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ")
return vm_node
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you capture any stderr from the ssh command, so that when it fails you can at least have some hope of figuring out what went wrong (it might simply be a network problem or a credential problem).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled it in self._oc.wait_for_vm_ssh by raising VMStateTimeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that actually capture the stderr from the ssh command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the full code here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not appear to me that the ssh output gets reported if the ssh fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a custom error for it:
raise VMStateTimeout(vm_name=vm_name, state='ssh')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But does it report the actual error reported by ssh? I want that to be reported to make it easier to debug and to distinguish glitches from real problems. It doesn't look like VMStateTimeout captures the reason for the error (what ssh reports on stderr).

self._data_dict.update({'total_run_time': total_run_time})
if not self._verification_only:
total_run_time = self._get_bootstorm_vm_total_run_time()
self._data_dict.update({'total_run_time': total_run_time})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be a bad idea to stick some distinctive value in there even in verification mode, such as -1, so that the schema is the same both ways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total_run_time will be empty when no input data

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but that's about testing that total_run_time is handled correctly. Not that big of a deal, but think about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer that it will be empty in ElasticSearch instead of -1, more simple to handle it.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 8, 2024

@RobertKrawitz, any more comments ?

Copy link
Member

@RobertKrawitz RobertKrawitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big thing I want is to capture stderr if ssh (or any other command) fails.

if self._oc.wait_for_vm_ssh(vm_name=vm_name, node_ip=node_ip, vm_node_port=vm_node_port):
logger.info(f"Successfully ssh into VM: '{vm_name}' in Node: '{vm_node}' ")
return vm_node
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not appear to me that the ssh output gets reported if the ssh fails.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 11, 2024

@RobertKrawitz, we dont need to get ssh output, we need to know if vm is accessible for ssh call or not

@RobertKrawitz
Copy link
Member

@RobertKrawitz, we dont need to get ssh output, we need to know if vm is accessible for ssh call or not

If the ssh fails, we want to know why. Connection timed out, connection reset by peer, authentication are different failures that ssh can report and I believe we do want to see what happened.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 12, 2024

If something will fail, I will raise it here

@RobertKrawitz
Copy link
Member

If something will fail, I will raise it here

As long as stderr from ssh gets saved, I don't care how you do it. I do want to be sure it's possible to see after the fact what happened.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 12, 2024

Sorry I didnt get your point.

@RobertKrawitz
Copy link
Member

Sorry I didnt get your point.

I want any stderr output from ssh to be saved, so that if there is a failure, we can look at it to try to figure out what happened. ssh can fail for any number of reasons which may have nothing to do with whether the VM is running: there might be an authentication failure, there might be a transient network problem, there might be any number of things. Simply "ssh to node failed" doesn't help us figure out what happened.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 12, 2024

I am focusing on the main flow right now and I dont want to focus in log collection in this pr only on verification steps.
I will open a separate PR for VM logs collection.

@RobertKrawitz
Copy link
Member

I am focusing on the main flow right now and I dont want to focus in log collection in this pr only on verification steps. I will open a separate PR for VM logs collection.

OK, but I think that it's important to collect information if verification fails. I will accept opening a PR for that, but I'd like you to open an issue against it prior to my approving this PR.

@ebattat
Copy link
Collaborator Author

ebattat commented Aug 12, 2024

Open issue regarding it: #862

Copy link
Member

@RobertKrawitz RobertKrawitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with issue now opened to log ssh failures in detail.

@ebattat ebattat merged commit 022dc27 into redhat-performance:main Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ok-to-test PR ok to test
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants