Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send email if reproducible built fails in the CI #7897

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ShahanaFarooqui
Copy link
Collaborator

Changelog-None.

@ShahanaFarooqui ShahanaFarooqui added this to the v25.02 milestone Dec 3, 2024
@ShahanaFarooqui
Copy link
Collaborator Author

ShahanaFarooqui commented Dec 3, 2024

Hi @s373nZ, Please review this PR which adds the functionality to send an email notification if the CI fails during any reproducible build step.

I have also updated the folder location for this script from /release to /repro just to avoid confusion with the release.yml/build-release.sh scripts.

I am also considering merging repro.yml with release.yml in the future, because repro.yml serves as a pre-stage for release.yml. I would appreciate your thoughts on that too.

Copy link
Contributor

@s373nZ s373nZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally receive an automatic email directly from Github whenever any CI I triggered fails, so I was a little curious around the circumstances regarding the requirement to send emails and did some digging. My guess is that the team is:

  • drowning in CI failure emails and most of the notifications are ignored or filtered
  • unclear who is receiving the notifications for scheduled workflows like the nightly repro builds

I found this documentation on workflow runs which states:

Notifications for scheduled workflows are sent to the user who initially created the workflow. If a different user updates the cron syntax in the workflow file, subsequent notifications will be sent to that user instead. If a scheduled workflow is disabled and then re-enabled, notifications will be sent to the user who re-enabled the workflow rather than the user who last modified the cron syntax.

Disabling and re-enabling the scheduled job (if you have permissions) or committing to modify the cron syntax could shift the automatic notifications to you. These seem a little flaky, so the requirement to have a solid solution captured in code is understandable, and for others in the community as well. Maybe Slack notifications could be an interesting alternative?

That said, your current approach looks pretty good to me. I would consider trying to consolidate the three different action-send-email steps into one by making it a completely separate job, something like:

jobs:
  ubuntu: [...]

  failure-notify:
    needs: ubuntu
    if: failure()
    steps:
      uses: dawidd6/action-send-mail@v3
      ...

It might not work, esp. with the matrix build, but it could be worth a shot. Inspiration here.

I have also updated the folder location for this script from /release to /repro just to avoid confusion with the release.yml/build-release.sh scripts.

Good idea! My first thought was to suggest changing cl-repro to cl-release in release.yml as well, but we have a logical dependency in build-release.sh.

I am also considering merging repro.yml with release.yml in the future, because repro.yml serves as a pre-stage for release.yml. I would appreciate your thoughts on that too.

We could try to reuse the repro.yml steps using a reusable workflow or a composite action. I considered trying to do this at the outset of the release automation work, but I think it belongs in a separate PR.

Unless you are suggesting to do away with the nightly builds in favor of detecting dirty builds only during the release process. I think there is value in early nightly detection so the release captain isn't tasked with too much triage at the last minute, but merging the two workflows is reasonable and should be possible.

Overall, LGTM pending your feedback re: the email step consolidation and the SMTP config.

.github/workflows/repro.yml Outdated Show resolved Hide resolved
@ShahanaFarooqui ShahanaFarooqui force-pushed the repro-step-failed-email branch 2 times, most recently from 2c84244 to d1ea50e Compare December 3, 2024 21:40
@ShahanaFarooqui
Copy link
Collaborator Author

@s373nZ

That said, your current approach looks pretty solid to me.

Yes, the goal is to send a customised email so that it stands out from the others and is not overlooked.

I would suggest consolidating the three different action-send-email steps into one, by making it a separate job.

Thanks for pushing me to avoid being lazy 😄! I was not happy about repeating the step, but wanted to capture the details of the failed step as well. It took a little time, but the email and workflow are much cleaner now. I ended up merging them into a single step at the end.

We could try reusing the repro.yml steps by utilizing a [reusable workflow or a composite action].

I would prefer to keep everything in one workflow. I plan to run the repro step on a scheduled basis, while the other steps (including the repro) will execute when a tag is pushed.

@s373nZ
Copy link
Contributor

s373nZ commented Dec 3, 2024

@ShahanaFarooqui It just occurred to me that this line (in all the Ubuntu Dockerfiles) might cause a problem with changing the folder location to /repro:

&& cp *.xz /repo/release/

Since the files are reused in both the repro build process and the release process, it might cause an error.

@ShahanaFarooqui
Copy link
Collaborator Author

ShahanaFarooqui commented Dec 3, 2024

It just occurred to me that this line (in all the Ubuntu Dockerfiles) might cause a problem with changing the folder location to /repro:

  • Should this be an issue, considering that the Dockerfile.noble is used exclusively to build the cl-repro-noble image, and the next step only uses this newly created image? Isn't the repro folder mainly responsible for creating the version.txt and git.log files and changing user permissions?

@s373nZ
Copy link
Contributor

s373nZ commented Dec 3, 2024

Should this be an issue, considering that the Dockerfile.noble is used exclusively to build the cl-repro-noble image, and the next step only uses this newly created image? Isn't the repro folder mainly responsible for creating the version.txt and git.log files and changing user permissions?

That CMD statement in the Dockerfile says that is the default command that is executed when you docker run the built image, so it is getting run to actually build the release here:

docker run --name cl-build -v $GITHUB_WORKSPACE:/repo -e FORCE_MTIME=$(date +%F) -t cl-repro-${{ matrix.version }}

That line is copying the repro build archive to the ./release directory, and the rest of the repro.yml steps look there to parse the filename here:

releasefile=$(ls release/clightning-*)

Also, IIRC I needed to create the ./release directory in the CI because it doesn't exist after a fresh checkout.

One initial idea to get around this might be to try adding an ARG to the Dockerfile which defaults to release but pass in repro in this case. Probably the thing to do is run the action in a test branch set to trigger on.branches.<test-branch-name> just to be sure.

Hope this makes sense. LMK if it doesn't, or I'm missing something. I can help or chime back in tomorrow.

@ShahanaFarooqui ShahanaFarooqui force-pushed the repro-step-failed-email branch 4 times, most recently from cc4fbaf to 507008f Compare December 4, 2024 00:51
@ShahanaFarooqui
Copy link
Collaborator Author

ShahanaFarooqui commented Dec 4, 2024

The action is still successfully completing with the /repro directory. However, I have reverted the change and switched back to using the /release folder for now. This will not be a concern once we merge both workflows anyways.

I also added the step Upload release artifact for easier debugging.

@ShahanaFarooqui ShahanaFarooqui force-pushed the repro-step-failed-email branch from 507008f to ef485cf Compare December 4, 2024 01:18
@s373nZ
Copy link
Contributor

s373nZ commented Dec 4, 2024

@ShahanaFarooqui I'm curious how it completed successfully w/o the release directory, but I can't see the build output from the run in the CI logs anymore. By grouping the commands and redirecting the output to log files, we would be sacrificing centralized observability in the Github Actions CI interface (for both successes and failures) in order to gain more context in the error email.

IMHO, I think the appropriate solution is to leave task commands as they were previously so we still get the log output in the UI, and just have a simpler email with less context that reports there was an error and provides a link to the CI output. The recipients can view the log output from Github actions in the case of an error, and others who are not on the DISTRIBUTION_LIST can also see the workflow output in cases of both success or failure. What do you think?

.github/workflows/repro.yml Outdated Show resolved Hide resolved
@ShahanaFarooqui ShahanaFarooqui force-pushed the repro-step-failed-email branch from ef485cf to f28587b Compare December 4, 2024 20:25
@ShahanaFarooqui
Copy link
Collaborator Author

IMHO, I think the appropriate solution is to leave task commands as they were previously so we still get the log output in the UI, and just have a simpler email with less context that reports there was an error and provides a link to the CI output. The recipients can view the log output from Github actions in the case of an error, and others who are not on the DISTRIBUTION_LIST can also see the workflow output in cases of both success or failure. What do you think?

Agree, I removed the error capturing code as it was causing more complexity than it was worth. I also tried using the tee command to write to both standard output and the file, but error handling at each step still added unnecessary complexity. It is better to rely on the receiver to check the details directly in the action itself.

Posting the tee error handling for future reference though:

sudo tar -C ${{ matrix.version }} -c . | docker import - ${{ matrix.version }} 2>&1 | tee command.log || exit_code=$?
if [ -n "$exit_code" ]; then
  echo "ERROR<<EOF" >> "$GITHUB_ENV"
  echo "$(cat command.log)" >> "$GITHUB_ENV"
  echo "EOF" >> "$GITHUB_ENV"
  exit 1
fi
Logs: <pre>${{ env.ERROR }}</pre><br/>

Copy link
Contributor

@s373nZ s373nZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! LGTM :)

ACK f28587b

@s373nZ
Copy link
Contributor

s373nZ commented Dec 4, 2024

One final observation - it looks like if all three repro builds failed, then three separate emails would be sent because the failure email step is a part of the matrix job, right? You could try to consolidate it down into one email by making that step a separate job as per #7897 (review) (using needs: ubuntu) but it might not be straightforward to add the failing step name directly into the email body.

If maybe having multiple emails per run failure is fine for you, the current code still LGTM 👍

@ShahanaFarooqui
Copy link
Collaborator Author

You could try to consolidate it down into one email by making that step a separate job

For now, I would prefer to keep them separate since I am the only one on the distribution list :D. I will consider merging them if their frequency seems unnecessarily high.

but it might not be straightforward to add the failing step name directly into the email body

Capturing the failing step name should not be a big issue, as we can set it as output instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants