Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable standard forms of GCP auth for oci sources #815

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thejosephstevens
Copy link

@thejosephstevens thejosephstevens commented Oct 13, 2024

This is related to #311 . We use k8s-based OIDC for all of our GCP auth, which allows for us to configure workloads in clusters in a simple, templated way to get access to cloud resources without ever having to exchange secret materials. This has some fairly straightforward security benefits, in addition to removing operational concerns of secret distribution, rotation, invalidation, etc...

The updated code piggybacks on the DefaultTokenSource method from the google auth package in order to support all standard methods of obtaining an auth token, in addition to the previously supported method requiring the well-known metadata endpoint to be available (only applicable in GCE and GKE clusters with Workload Identity enabled).

Note - This technically constitutes a breaking change for auth behavior, but it should be very uncommon for it to negatively impact users as they would have had to set up Flux with some form of ambient auth, and leave it in place even though it's not taking effect and they're falling back to metadata endpoint auth. If, for example, someone has a flux workload with GOOGLE_APPLICATION_CREDENTIALS configured, with the existing code that would be ignored, but with this change in place their auth behavior would change. I'm happy to write up full details on what scenarios would be impacted by this for release notes, where would be best for me to put that?

Edit:
I just wanted to note that I was able to create a build of source-controller based off of this pkg/oci revision and it appears to work fine. I'm not seeing any issues introduced, and I was able to pull a GCP Artifact Registry OCI repo with k8s-based OIDC from an EKS cluster.

@stefanprodan
Copy link
Member

Hey @thejosephstevens can you signoff your commits so we can run the tests. See https://github.com/fluxcd/pkg/pull/815/checks?check_run_id=31453839284

@stefanprodan stefanprodan added the area/oci OCI related issues and pull requests label Oct 14, 2024
@thejosephstevens thejosephstevens force-pushed the support-standard-gcp-oci-auth-methods branch from f6308e6 to 1ce9cc4 Compare October 14, 2024 18:44
@thejosephstevens
Copy link
Author

Hey @thejosephstevens can you signoff your commits so we can run the tests. See https://github.com/fluxcd/pkg/pull/815/checks?check_run_id=31453839284

Done!

@thejosephstevens thejosephstevens force-pushed the support-standard-gcp-oci-auth-methods branch from 1ce9cc4 to ded413e Compare October 14, 2024 19:33
@stefanprodan
Copy link
Member

@thejosephstevens please run make tidy and force push.

@thejosephstevens thejosephstevens force-pushed the support-standard-gcp-oci-auth-methods branch from ded413e to cb4a6bc Compare October 14, 2024 20:43
@stefanprodan
Copy link
Member

@darkowlzz can you fork this and run our GCP e2e tests please.

@stefanprodan
Copy link
Member

@thejosephstevens could you please run make test-oci and fix the failing ones.

@thejosephstevens thejosephstevens force-pushed the support-standard-gcp-oci-auth-methods branch from 590d595 to d48b493 Compare October 16, 2024 01:13
@thejosephstevens
Copy link
Author

@thejosephstevens could you please run make test-oci and fix the failing ones.

Just got them all working!

@matheuscscp
Copy link
Contributor

Hi @thejosephstevens, thanks for all the work here! 🏅

One question:

pull a GCP Artifact Registry OCI repo with k8s-based OIDC from an EKS cluster

Here, are you specifically referring to GCP Workload Identity Federation?

@thejosephstevens
Copy link
Author

Hi @thejosephstevens, thanks for all the work here! 🏅

One question:

pull a GCP Artifact Registry OCI repo with k8s-based OIDC from an EKS cluster

Here, are you specifically referring to GCP Workload Identity Federation?

I am! We generate cross-cloud identities for our workloads backed by k8s OIDC through Workload Identity Federation (or equivalent) in each cloud. We just happen to run our core operations primarily in GCP, which is why I jumped in here to contribute.

At a super specific level, we use GOOGLE_APPLICATION_CREDENTIALS to point to an external_account json file (mounted by configmap) which maps to the GCP identity, and attach a service account token with the appropriate audience for that GCP Workload Identity Pool Provider.

@thejosephstevens thejosephstevens force-pushed the support-standard-gcp-oci-auth-methods branch from d48b493 to d1c9d10 Compare October 18, 2024 19:23
@darkowlzz
Copy link
Contributor

darkowlzz commented Oct 22, 2024

Hi, I just tested this using our existing integration test suite and they continue to work for both workload identity and node identity within GCP. Inspecting the effect of the code change, I don't think it'll cause any breaking change for our existing users if they followed our recommended setup from the Flux docs.

Trying to understand the reasoning behind the previous implementation, it was implemented in fluxcd/image-reflector-controller#194, in 2021. It seems GCP announced workload identity federated in 2021 https://cloud.google.com/blog/products/identity-security/enable-keyless-access-to-gcp-with-workload-identity-federation. It's possible that the initial implementation was based on some old sample code from that year.

The auth client code changes look good to me, except for the proxy part, but I'll take some time to do some more testing and research around this before a full review.

Thanks for working on this.

@darkowlzz
Copy link
Contributor

darkowlzz commented Oct 24, 2024

I would like to clarify about my last comment, it felt like I may have mislead in what I looked at and tested. In the dev meeting today, @matheuscscp brought the concerns regarding the proxy configuration and we had a really long discussion about how to keep the proxy support. I believe @matheuscscp will post about it separately in detail. But I'd like to clarify that initially I was more concerned about the mention of breaking change and only tested that the auth continues to work for our existing use cases. I didn't review all the code, just the token exchange flow that the google library does, hence I said that I'd like to test and research more of this before a full review.
Federated identity will be a new use case that we'll add support for in this PR. In the last release, we added per object proxy support which includes this authentication #808. We need to maintain the proxy support for the existing use cases. Ideally, federated identity auth flow should also respect per object proxy.

@matheuscscp
Copy link
Contributor

Hi @thejosephstevens!

So, yes, like @darkowlzz mentioned in the previous comment, this PR is currently removing support for proxy configuration when talking to the internal metadata server of GCE/GKE. The proxyURL option is no longer being passed anywhere. We need to keep proxy support at least for the code path that already exists, as it was introduced in the last Flux release for adding per-object proxy configuration for OCIRepository.

On one hand, I confess that I'd normally be very tempted to accept the current solution of the PR as is. I appreciate very much the simplicity of just calling a single function from the Google library. This has many benefits, including the ease of maintenance, and I believe it's also what most projects would do.

On the other hand, the function DefaultTokenSource does not accept any non-global form of configuring a proxy URL for fetching the token, so if we just use it we lose per-object proxy configuration support.

I think a good compromise would be to copy, modify and maintain the function FindDefaultCredentialsWithParams, which is what DefaultTokenSource ultimately calls.

As the first line of the documentation says, the function FindDefaultCredentialsWithParams searches for Google's "Application Default Credentials", which are the methods used by the Google libraries for retrieving authentication from the environment of the process. Those methods include both the one we currently have, which is talking to the GCE/GKE internal metadata server, and the one which reads a JSON configuration file located either at os.Getenv("GOOGLE_APPLICATION_CREDENTIALS") or wellKnownFile(). The JSON configuration file method is the one that covers Workload Identity Federation, and also JSON Service Account Keys, along a few other methods.

The function FindDefaultCredentialsWithParams is simple and short enough for us to copy, modify and maintain. We would just need to replace their implementation for talking to the internal metadata server with ours, which supports proxy configuration. This would unfortunately change the interface of this fuction to optionally accept the proxy URL, but that's the whole point of making this choice and not a big deal. I'd also be fine copying the other small, unexported functions called by this one, as they are all very simple and short as well, but I'd be fine with inlining them as well. To simplify even further, I think it would also be okay to change the return type from *google.Credentials to oauth2.TokenSource. The latter is the interface that we will need to implement using our current code (that fetches the token from the internal metadata server) in order to conform with the JSON configuration file code that we will leverage from the library, which will be handed to us also through this interface.

I'd like to note that some of the JSON configurations may still require talking to external HTTP endpoints. Full support for per-object proxy configuration would mean supporting those HTTP calls as well. Workload Identity Federation is probably one of those, as the OIDC ServiceAccount Token issued by Kubernetes must still be exchanged for a Google Access Token, which is done by talking to a Google HTTP API. It would be harder to support proxy for that, as it would require us to copy and adapt a lot more code from the Google library. But, as I mentioned previously, I'd be fine introducing support for new use cases without removing proxy support for the code path that we already have. And we probably need an issue to track proxy support for the remaining code paths.

@thejosephstevens
Copy link
Author

Hmmmm, that seems like a lot of code to copy and maintain that's currently just handled by Google. My initial reaction is it seems like if the requirement is to maintain per-object proxy support for requesting metadata-service tokens, it seems like it would be simplest to just split logic on the presence of a proxyUrl override (which could be served by the existing logic), and have all remaining calls fall through the Google-maintained defaultTokenSource path.

I may be missing some context here though, I know that Flux supports auth strategies defined at the object level, but I'm not actually aware of when that would be used. From my experience I would expect:

  • For teams where Flux is operated by one group and consumed by others, I would expect objects to either pass in their own creds, or leverage the existing "ambient" access that Flux has (set up by SRE/Infra/whoever)
  • For teams where Flux and the objects are operated by the same group I would expect similar patterns (but more preference for "ambient" auth)

I understand global proxy requirements (something-something network complexities), but in order to need per-object proxy support I think that would require consumers to be effectively shoving their own identities into Flux for auth by way of proxies, but as it stands today the metadata URL is hardcoded in Flux, so I just don't know how that would even function to produce different auth resolutions. From the base issue here it looks like maybe this is focused on consistency across providers and the inherent per-object auth model that Flux uses, but I don't know if I'm missing use cases from the community. I'd be interested if there's context you could point me to (or raise) that I'm missing.

I don't have a strong opinion on implementation details, but I think if the consensus is to head down the route of copying chunks of the Google auth providers and working them in I may ask if one of you could pick up the remainder of the PR, as I just hadn't earmarked that much more time for this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/oci OCI related issues and pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants