Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promise will never complete when instrumenting worker #150

Open
jmaroeder opened this issue Jul 20, 2024 · 6 comments · May be fixed by #182
Open

Promise will never complete when instrumenting worker #150

jmaroeder opened this issue Jul 20, 2024 · 6 comments · May be fixed by #182

Comments

@jmaroeder
Copy link

We have a pretty complex cloudflare worker that we recently added instrumentation to with otel-cf-workers. The library has worked like a dream, providing extra insight into our requests. However, when we enabled the library, we noticed an uptick in errors from Cloudflare, with the error message Promise will never complete. This seems to happen on about 0.01% of requests (with a sampling ratio of 0.01, or 1%).

Has anyone else run into this? When we disable instrumentation, the error goes away. Unfortunately cloudflare's documentation about this error is pretty scant.

@evanderkoogh
Copy link
Owner

Hey @jmaroeder, the cause of the error is almost always that you end up awaiting a Promise that was created in another request.

I am not aware of any of those situations in this library (which obviously doesn't mean that there aren't any.. )

Do you have any more information about what these requests have in common? Are they requests that are cancelled from the client? Are they to a particular URL? Do those URLs use a particular type of instrumentation? (especially something like cache?

@jmaroeder
Copy link
Author

It is possible that the requests are being canceled from the client - we have a large number of those.

These do not appear to be limited to a specific URL or pattern of URL.

As far as instrumentation:

  • Nearly all of our requests instrument one or more fetch GET
  • A small portion (<5%) instrument fetch POST
  • Approximately half our requests instrument kv GET
  • A miniscule portion (0.0003%) instrument cache GET and cache PUT
  • Our worker only responds to handler.fetch events, and also makes extensive use of ctx.waitUntil

One other thing worth noting - it may have to do with the volume of requests. We have two different deployments of our worker running identical code, but with significant differences in the number of requests handled. Both workers handle multiple domains.

One deployment sees about 1,500 requests per second under typical load, and this particular worker actually only had an error rate of less than 0.0002%.

The other deployment sees about 15,000 requests per second under typical load, and this is the one that saw error rates of 0.01%. So perhaps the issue only shows up with extremely high request volume?

@evanderkoogh
Copy link
Owner

Thanks! That gives me at least some direction to look into things. It is possible that there is an interaction between the flushing of the exporter and ctx.waitUntil that you can run into under extremely high load.

This week is extremely packed, but hopefully I'll have some time next week to have a proper look at things.

@evanderkoogh
Copy link
Owner

Right.. sorry it took so long to find the cause of this.. as you can imagine it is quite tricky. But I think I found while doing a thorough rethink of the way the Tracer/SpanProcessor and Exporter work. I will take this on as part of said rethink and fix it..

@evanderkoogh
Copy link
Owner

Ok.. more analysis being done. And I know exactly where it is going wrong and that will indeed be taken care of in the upcoming logic rewrite. In the meantime I guess it would be good to know that there is no impact to your users. The error happens after the both responses are returned to their clients and one trace is being sent while another trace export is still in flight.

I am pretty sure that both traces are also still exported successfully, but even if they don't, with the percentage of errors, that won't make a meaningful impact anyway. But thanks for the report. Greatly appreciate being able to fix this bug, even if it is only rarely triggered.

@jmaroeder
Copy link
Author

Thank you for looking into it further!

@evanderkoogh evanderkoogh linked a pull request Dec 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants