`Promise will never complete` when instrumenting worker #150

jmaroeder · 2024-07-20T01:41:02Z

We have a pretty complex cloudflare worker that we recently added instrumentation to with otel-cf-workers. The library has worked like a dream, providing extra insight into our requests. However, when we enabled the library, we noticed an uptick in errors from Cloudflare, with the error message Promise will never complete. This seems to happen on about 0.01% of requests (with a sampling ratio of 0.01, or 1%).

Has anyone else run into this? When we disable instrumentation, the error goes away. Unfortunately cloudflare's documentation about this error is pretty scant.

The text was updated successfully, but these errors were encountered:

evanderkoogh · 2024-08-12T06:20:45Z

Hey @jmaroeder, the cause of the error is almost always that you end up awaiting a Promise that was created in another request.

I am not aware of any of those situations in this library (which obviously doesn't mean that there aren't any.. )

Do you have any more information about what these requests have in common? Are they requests that are cancelled from the client? Are they to a particular URL? Do those URLs use a particular type of instrumentation? (especially something like cache?

jmaroeder · 2024-08-26T16:32:55Z

It is possible that the requests are being canceled from the client - we have a large number of those.

These do not appear to be limited to a specific URL or pattern of URL.

As far as instrumentation:

Nearly all of our requests instrument one or more fetch GET
A small portion (<5%) instrument fetch POST
Approximately half our requests instrument kv GET
A miniscule portion (0.0003%) instrument cache GET and cache PUT
Our worker only responds to handler.fetch events, and also makes extensive use of ctx.waitUntil

One other thing worth noting - it may have to do with the volume of requests. We have two different deployments of our worker running identical code, but with significant differences in the number of requests handled. Both workers handle multiple domains.

One deployment sees about 1,500 requests per second under typical load, and this particular worker actually only had an error rate of less than 0.0002%.

The other deployment sees about 15,000 requests per second under typical load, and this is the one that saw error rates of 0.01%. So perhaps the issue only shows up with extremely high request volume?

evanderkoogh · 2024-08-28T02:11:13Z

Thanks! That gives me at least some direction to look into things. It is possible that there is an interaction between the flushing of the exporter and ctx.waitUntil that you can run into under extremely high load.

This week is extremely packed, but hopefully I'll have some time next week to have a proper look at things.

evanderkoogh · 2024-09-19T03:12:58Z

Right.. sorry it took so long to find the cause of this.. as you can imagine it is quite tricky. But I think I found while doing a thorough rethink of the way the Tracer/SpanProcessor and Exporter work. I will take this on as part of said rethink and fix it..

evanderkoogh · 2024-09-21T05:29:03Z

Ok.. more analysis being done. And I know exactly where it is going wrong and that will indeed be taken care of in the upcoming logic rewrite. In the meantime I guess it would be good to know that there is no impact to your users. The error happens after the both responses are returned to their clients and one trace is being sent while another trace export is still in flight.

I am pretty sure that both traces are also still exported successfully, but even if they don't, with the percentage of errors, that won't make a meaningful impact anyway. But thanks for the report. Greatly appreciate being able to fix this bug, even if it is only rarely triggered.

jmaroeder · 2024-09-21T12:32:42Z

Thank you for looking into it further!

evanderkoogh linked a pull request Dec 2, 2024 that will close this issue

WIP Core logic refactor #182

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Promise will never complete` when instrumenting worker #150

`Promise will never complete` when instrumenting worker #150

jmaroeder commented Jul 20, 2024

evanderkoogh commented Aug 12, 2024

jmaroeder commented Aug 26, 2024

evanderkoogh commented Aug 28, 2024

evanderkoogh commented Sep 19, 2024

evanderkoogh commented Sep 21, 2024

jmaroeder commented Sep 21, 2024

Promise will never complete when instrumenting worker #150

Promise will never complete when instrumenting worker #150

Comments

jmaroeder commented Jul 20, 2024

evanderkoogh commented Aug 12, 2024

jmaroeder commented Aug 26, 2024

evanderkoogh commented Aug 28, 2024

evanderkoogh commented Sep 19, 2024

evanderkoogh commented Sep 21, 2024

jmaroeder commented Sep 21, 2024

`Promise will never complete` when instrumenting worker #150

`Promise will never complete` when instrumenting worker #150