Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add builder for translations GPU images using multiuser generic-worker #142

Closed

Conversation

bhearsum
Copy link
Contributor

Over in mozilla/translations#466 I'm working on adding support for automatic uploads of artifacts with one of our scriptworkers. Using scriptworkers requires enabling chain of trust. While testing this, I discovered that the simple engine doesn't support chain of trust, which means we'll need to move GPU workers to the multiuser engine.

This patch is a first shot at something that might work. I based it on gw_fxci_gcp_l1_gui.yaml with cuda, papertrail, and translations requirements added. I was able to build it with papertrail disabled (I don't have those secrets), but I'm unable to test my own built images properly, so it's difficult to be certain this will work. Feel free to throw this out if there's a different configuration that's preferred.

@aerickson
Copy link
Member

Do we want L3 only or L1 and L3 versions of this image?

@bhearsum
Copy link
Contributor Author

Do we want L3 only or L1 and L3 versions of this image?

Just L1 for now is fine. (I do expect to be requesting L2 or L3 images in the near future, but I'm not quite ready for them yet, and not sure which we'll end up using.)

@aerickson
Copy link
Member

Building the following at 71d7783:

monopacker build gw_translations_multiengine --secrets_file real_secrets_l1.yaml --packer-args '-on-error=ask' produced gw-translations-multiengine-googlecompute-2024-06-25t00-40-27z.

@bhearsum
Copy link
Contributor Author

gw-translations-multiengine-googlecompute-2024-06-25t00-40-27z

Thank you! I'm going to test this out this week!

@bhearsum
Copy link
Contributor Author

It looks like these are spawning, but not claiming jobs. Here's what I saw in worker manager with one task pending:
image

And there seemed to be many errors logged in GCP:

2024-06-25 15:29:43.761
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig systemd[1]: Starting Manage, Install and Generate Color Profiles...
2024-06-25 15:29:43.788
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig gsd-media-keys[1938]: Failed to grab accelerator for keybinding settings:hibernate
2024-06-25 15:29:43.788
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig gsd-media-keys[1938]: Failed to grab accelerator for keybinding settings:playback-repeat
2024-06-25 15:29:43.803
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig spice-vdagent[2096]: vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0 does not exist, exiting
2024-06-25 15:29:43.806
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig /usr/libexec/gdm-wayland-session[1820]: dbus-daemon[1820]: [session uid=132 pid=1820] Activating service name='org.gnome.ScreenSaver' requested by ':1.26' (uid=132 pid=1945 comm="/usr/libexec/gsd-power " label="unconfined")
2024-06-25 15:29:43.807
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig gnome-session-binary[1821]: Entering running state
2024-06-25 15:29:43.832
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig colord[2091]: failed to get edid data: EDID length is too small
2024-06-25 15:29:43.859
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig /usr/libexec/gdm-wayland-session[1820]: dbus-daemon[1820]: [session uid=132 pid=1820] Successfully activated service 'org.gnome.ScreenSaver'
2024-06-25 15:29:43.861
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig dbus-daemon[740]: [system] Successfully activated service 'org.freedesktop.ColorManager'
2024-06-25 15:29:43.861
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig systemd[1]: Started Manage, Install and Generate Color Profiles.
2024-06-25 15:29:43.876
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig xbrlapi.desktop[2098]: openConnection: connect: No such file or directory
2024-06-25 15:29:43.876
Jun 25 14:29:43 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig xbrlapi.desktop[2098]: cannot connect to braille devices daemon brltty at :0
2024-06-25 15:29:44.189
Jun 25 14:29:44 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig gsd-color[1931]: failed to get edid: unable to get EDID for output
2024-06-25 15:29:44.196
Jun 25 14:29:44 translations-1-b-linux-v100-gpu-4-bug1-okwdinv2qc2sonjc8ljsig gsd-color[1931]: unable to get EDID for xrandr-Virtual-1: unable to get EDID for output

@aerickson
Copy link
Member

Hmm, I know we do some EDID tweaks in the ubuntu-jammy-from-community-gui scripts. I wonder if the CUDA install is messing with something.

We could try reordering the scripts.

@bhearsum
Copy link
Contributor Author

That one is still hitting the same errors :(:

03:00:10.181
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gnome-shell[1913]: (../clutter/clutter/clutter-frame-clock.c:332):clutter_frame_clock_notify_presented: code should not be reached
2024-06-26 03:00:10.932
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gnome-shell[1913]: (../clutter/clutter/clutter-frame-clock.c:332):clutter_frame_clock_notify_presented: code should not be reached
2024-06-26 03:00:10.932
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow dbus-daemon[747]: [system] Activating via systemd: service name='org.freedesktop.ColorManager' unit='colord.service' requested by ':1.76' (uid=132 pid=1998 comm="/usr/libexec/gsd-color " label="unconfined")
2024-06-26 03:00:10.990
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow spice-vdagent[2159]: vdagent virtio channel /dev/virtio-ports/com.redhat.spice.0 does not exist, exiting
2024-06-26 03:00:10.990
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow systemd[1]: Starting Manage, Install and Generate Color Profiles...
2024-06-26 03:00:10.990
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gnome-session-binary[1890]: Entering running state
2024-06-26 03:00:10.990
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gsd-media-keys[2005]: Failed to grab accelerator for keybinding settings:playback-repeat
2024-06-26 03:00:10.990
Jun 26 02:00:10 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gsd-media-keys[2005]: Failed to grab accelerator for keybinding settings:hibernate
2024-06-26 03:00:11.004
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow /usr/libexec/gdm-wayland-session[1889]: dbus-daemon[1889]: [session uid=132 pid=1889] Activating service name='org.gnome.ScreenSaver' requested by ':1.24' (uid=132 pid=2010 comm="/usr/libexec/gsd-power " label="unconfined")
2024-06-26 03:00:11.038
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow xbrlapi.desktop[2165]: openConnection: connect: No such file or directory
2024-06-26 03:00:11.038
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow xbrlapi.desktop[2165]: cannot connect to braille devices daemon brltty at :0
2024-06-26 03:00:11.057
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow /usr/libexec/gdm-wayland-session[1889]: dbus-daemon[1889]: [session uid=132 pid=1889] Successfully activated service 'org.gnome.ScreenSaver'
2024-06-26 03:00:11.064
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow colord[2164]: failed to get edid data: EDID length is too small
2024-06-26 03:00:11.069
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow dbus-daemon[747]: [system] Successfully activated service 'org.freedesktop.ColorManager'
2024-06-26 03:00:11.069
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow systemd[1]: Started Manage, Install and Generate Color Profiles.
2024-06-26 03:00:11.495
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gsd-color[1998]: failed to get edid: unable to get EDID for output
2024-06-26 03:00:11.503
Jun 26 02:00:11 translations-1-b-linux-v100-gpu-4-bug1-czq4tjwjssgrq31cem-uow gsd-color[1998]: unable to get EDID for xrandr-Virtual-1: unable to get EDID for output

@aerickson
Copy link
Member

aerickson commented Jun 26, 2024

Hm, sounds like the nvidia driver is interacting with Wayland in some way (in a more complex way than I initially thought).

We could wait for a g-w multi that doesn't require a GUI (taskcluster/taskcluster#6786, er taskcluster/taskcluster#4595) or we could investigate further.

I think pmoore is working on a 2404 config in community (but not sure if that's going to solve the wayland nvidia issues - someone recently made a comment that everyone with a nvidia gpu is still using X11. not sure if that's true).

@bhearsum
Copy link
Contributor Author

Do we have any idea on a timeline for the non-GUI multi engine generic worker? (I'd be happy to help guinea pig it in Translations.)

@aerickson
Copy link
Member

No, no timeline yet. We're going to discuss at our next RelSRE/Taskcluster meeting (I think we're going to try to invite Releng to this next one).

@bhearsum
Copy link
Contributor Author

No, no timeline yet. We're going to discuss at our next RelSRE/Taskcluster meeting (I think we're going to try to invite Releng to this next one).

OK! Maybe we can figure out next steps here after that?

@aerickson
Copy link
Member

Headless is out. We should revisit this.

@bhearsum bhearsum force-pushed the translations-multi branch from 71d7783 to 9fa8d33 Compare October 7, 2024 15:51
@bhearsum
Copy link
Contributor Author

bhearsum commented Oct 7, 2024

I'm ready to give it another go anytime!

@bhearsum
Copy link
Contributor Author

I've updated this with the latest Taskcluster version & a fix for the kernel uninstall (the ubuntu version number part changed...).

@bhearsum bhearsum requested a review from aerickson November 20, 2024 19:23
@aerickson
Copy link
Member

Getting an error:

    googlecompute.gw_translations_multiengine: Removing linux-image-6.8.0-1018-gcp
    googlecompute.gw_translations_multiengine: -----------------------------------
    googlecompute.gw_translations_multiengine:
    googlecompute.gw_translations_multiengine: You are running a kernel (version 6.8.0-1018-gcp) and attempting to remove the
    googlecompute.gw_translations_multiengine: same version.
    googlecompute.gw_translations_multiengine:
    googlecompute.gw_translations_multiengine: This can make the system unbootable as it will remove
    googlecompute.gw_translations_multiengine: /boot/vmlinuz-6.8.0-1018-gcp and all modules under the directory
    googlecompute.gw_translations_multiengine: /lib/modules/6.8.0-1018-gcp. This can only be fixed with a copy of the kernel
    googlecompute.gw_translations_multiengine: image and the corresponding modules.
    googlecompute.gw_translations_multiengine:
    googlecompute.gw_translations_multiengine: It is highly recommended to abort the kernel removal unless you are prepared to
    googlecompute.gw_translations_multiengine: fix the system after removal.
    googlecompute.gw_translations_multiengine:

The build just seems to hang after that. I'll debug some more.

@bhearsum bhearsum force-pushed the translations-multi branch 2 times, most recently from 0bb51c4 to 3b7e5b4 Compare November 21, 2024 15:02
@bhearsum
Copy link
Contributor Author

bhearsum commented Dec 2, 2024

We're not going to do this; translations will switch to the ubuntu 24.04 images in https://github.com/mozilla-platform-ops/worker-images instead.

@bhearsum bhearsum closed this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants