Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use dnsmasq for translations generic worker #149

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bhearsum
Copy link
Contributor

Hopefully a fix for mozilla/translations#549

I haven't had a chance to test out an image build with this myself yet.

@bhearsum
Copy link
Contributor Author

Managed to build with this today. I got a working image, but it's causing seemingly harmless DNS resolution issues in later steps of the build:

==> googlecompute.gw_translations_gcp: Provisioning with shell script: /tmp/packer-shell1510919229
==> googlecompute.gw_translations_gcp: sudo: unable to resolve host packer-67080b48-3d4a-3126-f2c5-f6d432397504: Name or service not known

The image still builds, and works fine, but quite clearly DNS resolution that relies on the search domain being set properly does not work with this patch. Ideally, whatever dhcp client is being used would dump this information out somewhere, and we could point dnsmasq at that -- but I'm not sure what's doing that.

@bhearsum
Copy link
Contributor Author

@aerickson - I realize this is most likely not landable, but I'd appreciate any thoughts or insight you have here.

@bhearsum bhearsum requested a review from aerickson October 10, 2024 17:56
@aerickson
Copy link
Member

I think it's just the packer ssh session that is still trying to use the systemd resolver (new processes are fine).

If it builds fine and name resolutions work I think it's fine (we could add a note in the dnsmasq script that the errors are non-fatal).

Copy link
Member

@aerickson aerickson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LMK if you'd like me to make a build. I noticed gw-translations-gcp-googlecompute-2024-10-10t20-47-51z that you might have created when testing.

@bhearsum
Copy link
Contributor Author

LMK if you'd like me to make a build. I noticed gw-translations-gcp-googlecompute-2024-10-10t20-47-51z that you might have created when testing.

In the taskcluster-imaging project? I didn't know I had permissions to that one... 😬.

We can try a build, yeah. I can give it a quick a sanity check, but I don't expect we'll know if the issue is truly fixed until it's in production for awhile.

@aerickson
Copy link
Member

LMK if you'd like me to make a build. I noticed gw-translations-gcp-googlecompute-2024-10-10t20-47-51z that you might have created when testing.

In the taskcluster-imaging project? I didn't know I had permissions to that one... 😬.

You're right, you only have roles/compute.imageUser (which can't create images). It must have been me and UTC threw me. I was close to finishing a build and ctrl-c'd, but it must have already made the image.

We can try a build, yeah. I can give it a quick a sanity check, but I don't expect we'll know if the issue is truly fixed until it's in production for awhile.

Sounds good. Will do a build. (Not sure if the image above is good, so I'll make a new one.)

@aerickson
Copy link
Member

Built gw-translations-gcp-googlecompute-2024-10-11t18-26-57z with monopacker build gw_translations_gcp --secrets_file real_secrets_l1.yaml --packer-args '-on-error=ask'.

SBOM is at https://github.com/mozilla-platform-ops/monopacker-sboms/blob/main/gw_translations_gcp/gw-translations-gcp-googlecompute-2024-10-11t18-26-57z.md.

@bhearsum
Copy link
Contributor Author

This image seems to be causing issues for reasons unrelated to dnsmasq. We're getting errors from marian like:

task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z] Error: Curand error 203 - /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74: curandCreateGenerator(&generator_, CURAND_RNG_PSEUDO_DEFAULT)
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z] Error: Aborted from marian::CurandRandomGenerator::CurandRandomGenerator(size_t, marian::DeviceId) in /builds/worker/fetches/marian-source/src/tensors/rand.cpp:74
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z] [CALL STACK]
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z]    marian::CurandRandomGenerator::  CurandRandomGenerator  (unsigned long,  marian::DeviceId) + 0x83f
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z]    marian::  createRandomGenerator  (unsigned long,  marian::DeviceId) + 0x69
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z]    marian::  BackendByDeviceId  (marian::DeviceId,  unsigned long) + 0xa0
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z]    marian::ExpressionGraph::  setDevice  (marian::DeviceId,  std::shared_ptr<marian::Device>) + 0x80
[task 2024-10-15T15:36:51.296Z] 
[task 2024-10-15T15:36:51.296Z]    marian::GraphGroup::  initGraphsAndOpts  ()        + 0x1e5

marian-nmt/marian-dev#666 seems to suggest that this is a mismatch between the CUDA that Marian was compiled against and what's being used at runtime. The SBOM seems to confirm this: we have 12.6.2 installed on image, but we compiled Marian against 12.1.0: https://github.com/mozilla/firefox-translations-training/blob/d1d1efc441bfa9fc0f3e2c176f778197d16899df/taskcluster/kinds/fetch/toolchains.yml#L45-L52. I'm going to try bumping the latter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants