Improve WorkerLost errors handling #8969

kwaszczuk · 2024-04-16T19:52:03Z

kwaszczuk
Apr 16, 2024

In my current team, we have encountered difficulties when gracefully handling WorkerLost errors (especially OOMs) while avoiding their infinite requeues. What I would like to see in Celery is the capability to requeue a task after WorkerLost error only a limited number of times.

Currently, Celery provides us with the task_reject_on_worker_lost setting, which always calls reject(requeue=True) under the hood. This means it delegates task retrying directly to the broker. However, most (if not all, I've checked RabbitMQ, Redis, and SQS) brokers do not support a retry counting mechanism when the requeue=True option is used. Therefore, without a workaround, Celery will indefinitely requeue some tasks, causing WorkerLost errors.

One way to work around this problem is to not use task_reject_on_worker_lost and instead implement a custom retry mechanism based on RabbitMQ DLX. However, this solution requires the task_acks_on_failure_or_timeout=False setting, which significantly disrupts Celery capabilities in terms of general error handling.

I would be happy to work on improving the current state of things in Celery in this matter. As of now, I have come up with two possible approaches (A & B) for improving Celery's WorkerLost handling:

A. Use Celery's retry mechanism (`app.Task.retry()`) instead of `reject(requeue=True)`

Pros:

Retries could be easily limited using Task.max_retries or manually by accessing app.Task.request.retries.
Requires minimal code changes.

Cons:

Not fully backward-compatible - tasks may stop retrying indefinitely by default due to the preconfigured Task.max_retries limit.

B. Introduce additional Celery settings `task_requeue_on_worker_lost` and `task_worker_lost_is_failure`

task_requeue_on_worker_lost would allow us to select whether reject() should be called with False or True, while task_worker_lost_is_failure would define if a task should be reported as failed if it wasn't requeued. With these two settings available, it would be possible to implement WorkerLost retries with DLX without affecting the lifecycle of non-WorkerLost exceptions.

Pros:

Allows for granular control over WorkerLost lifecycle - we can decide whether to requeue, reject and fail, or reject without fail for the DLX to jump in.
Fully backward-compatible - we can ship new options with default values that keep WorkerLost handling unchanged.

Cons:

WorkerLost handling will now be configurable through three options, making it more complicated.
Does not provide explicit limitable retries - it still needs to be delegated to the broker.

I am quite sure that meaningful changes for both the above solutions would be limited to the following part of the Celery codebase: https://github.com/celery/celery/blob/v5.3.6/celery/worker/request.py#L600-L640

Personally, I prefer the A approach more. However, I am concerned about its deficiencies in backward-compatibility. Considering the task_reject_on_worker_lost option was introduced over 9 years ago, there's a high chance many Celery deployments implicitly depend on its current behavior. Perhaps, we could address this by making the new implementation opt-in while slowly deprecating the original one.

I would love to get feedback about the suggested solutions so that we could agree on the best course of action. I am willing to provide some pseudo-code implementations for both solutions if needed.

nnseva · 2024-12-24T14:19:41Z

nnseva
Dec 24, 2024

There are also two additional problems I met when processing WorkerLostError.

The on_failure task error handler is not called when WorkerLostError happens. This task error handler usually is called in the context of the worker process, which is definitely lost when WorkerLostError has happened. So, this particular case could be processed in the context of the Request instead.
The task_failure signal as opposite, is emitted. Unfortunately, the sender (the Task instance) parameter is provided with an empty request member.

The both problems could be easily solved in the context of the Request.on_failure error handler method code.

To solve the first problem, the self.task.on_failure should be called somewhere in the context of the following code block found in the worker/request.py file:

...
        if not requeue and (is_worker_lost or not return_ok):
            # only mark as failure if task has not been requeued
            self.task.backend.mark_as_failure(
                self.id, exc, request=self._context,
                store_result=self.store_errors,
            )
            
            signals.task_failure.send(sender=self.task, task_id=self.id,
                                      exception=exc, args=self.args,
                                      kwargs=self.kwargs,
                                      traceback=exc_info.traceback,
                                      einfo=exc_info)
...

It would also be useful to process exceptions which can be issued by the Task.on_failure error handler, like Retry, Ignore, etc which can change the processing order of the task.

To solve the second problem, before sending the task_failure signal in the code above, the instance of the self.task should inherit it's request content from the self._request_dict, by the code like:

    ...
    self.task.request.update(self._request_dict)
    ...

0 replies

ocervell · 2025-03-03T19:08:37Z

ocervell
Mar 3, 2025

@kwaszczuk what did you end up doing ? I'm having the same difficulties running on k8s where pods can be evicted, killed, OOMed ... My tasks get requeued indefinitely and I have no way to really control it properly.

1 reply

kwaszczuk Mar 8, 2025
Author

I ended up using RabbitMQ's dead letter exchange mechanism for tasks retries, as it auto-increments the message's x-death header on every pass through DLX. Then I had a custom logic in the pre-run hook for the task, that would terminate it if its x-death exceeded a certain threshold. However, this approach requires specific Celery task handling configuration, namely:

task_acks_late = True
task_acks_on_failure_or_timeout = False
task_reject_on_worker_lost = False
worker_cancel_long_running_tasks_on_connection_loss = True

So it may not be suitable for every use case.

auvipy · 2025-05-19T05:58:04Z

auvipy
May 19, 2025
Maintainer

In any case, if you think there could be some improvement in celery, feel free to come with a draft PR

2 replies

caspervdw May 22, 2025

I encountered this issue as well and made #9720 to address part of it.

auvipy May 28, 2025
Maintainer

thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve WorkerLost errors handling #8969

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Uh oh!

Improve WorkerLost errors handling #8969

Uh oh!

kwaszczuk Apr 16, 2024

A. Use Celery's retry mechanism (app.Task.retry()) instead of reject(requeue=True)

B. Introduce additional Celery settings task_requeue_on_worker_lost and task_worker_lost_is_failure

Replies: 3 comments · 3 replies

Uh oh!

Uh oh!

nnseva Dec 24, 2024

Uh oh!

ocervell Mar 3, 2025

Uh oh!

kwaszczuk Mar 8, 2025 Author

Uh oh!

auvipy May 19, 2025 Maintainer

Uh oh!

caspervdw May 22, 2025

Uh oh!

auvipy May 28, 2025 Maintainer

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

kwaszczuk
Apr 16, 2024

A. Use Celery's retry mechanism (`app.Task.retry()`) instead of `reject(requeue=True)`

B. Introduce additional Celery settings `task_requeue_on_worker_lost` and `task_worker_lost_is_failure`

Replies: 3 comments 3 replies

nnseva
Dec 24, 2024

ocervell
Mar 3, 2025

kwaszczuk Mar 8, 2025
Author

auvipy
May 19, 2025
Maintainer

auvipy May 28, 2025
Maintainer