Replies: 3 comments 3 replies
-
There are also two additional problems I met when processing
The both problems could be easily solved in the context of the To solve the first problem, the ...
if not requeue and (is_worker_lost or not return_ok):
# only mark as failure if task has not been requeued
self.task.backend.mark_as_failure(
self.id, exc, request=self._context,
store_result=self.store_errors,
)
signals.task_failure.send(sender=self.task, task_id=self.id,
exception=exc, args=self.args,
kwargs=self.kwargs,
traceback=exc_info.traceback,
einfo=exc_info)
... It would also be useful to process exceptions which can be issued by the To solve the second problem, before sending the ...
self.task.request.update(self._request_dict)
... |
Beta Was this translation helpful? Give feedback.
-
@kwaszczuk what did you end up doing ? I'm having the same difficulties running on k8s where pods can be evicted, killed, OOMed ... My tasks get requeued indefinitely and I have no way to really control it properly. |
Beta Was this translation helpful? Give feedback.
-
In any case, if you think there could be some improvement in celery, feel free to come with a draft PR |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In my current team, we have encountered difficulties when gracefully handling WorkerLost errors (especially OOMs) while avoiding their infinite requeues. What I would like to see in Celery is the capability to requeue a task after WorkerLost error only a limited number of times.
Currently, Celery provides us with the
task_reject_on_worker_lost
setting, which always calls reject(requeue=True) under the hood. This means it delegates task retrying directly to the broker. However, most (if not all, I've checked RabbitMQ, Redis, and SQS) brokers do not support a retry counting mechanism when the requeue=True option is used. Therefore, without a workaround, Celery will indefinitely requeue some tasks, causing WorkerLost errors.One way to work around this problem is to not use
task_reject_on_worker_lost
and instead implement a custom retry mechanism based on RabbitMQ DLX. However, this solution requires thetask_acks_on_failure_or_timeout=False
setting, which significantly disrupts Celery capabilities in terms of general error handling.I would be happy to work on improving the current state of things in Celery in this matter. As of now, I have come up with two possible approaches (A & B) for improving Celery's WorkerLost handling:
A. Use Celery's retry mechanism (
app.Task.retry()
) instead ofreject(requeue=True)
Pros:
Task.max_retries
or manually by accessingapp.Task.request.retries
.Cons:
Task.max_retries
limit.B. Introduce additional Celery settings
task_requeue_on_worker_lost
andtask_worker_lost_is_failure
task_requeue_on_worker_lost
would allow us to select whether reject() should be called with False or True, whiletask_worker_lost_is_failure
would define if a task should be reported as failed if it wasn't requeued. With these two settings available, it would be possible to implement WorkerLost retries with DLX without affecting the lifecycle of non-WorkerLost exceptions.Pros:
Cons:
I am quite sure that meaningful changes for both the above solutions would be limited to the following part of the Celery codebase: https://github.com/celery/celery/blob/v5.3.6/celery/worker/request.py#L600-L640
Personally, I prefer the A approach more. However, I am concerned about its deficiencies in backward-compatibility. Considering the
task_reject_on_worker_lost
option was introduced over 9 years ago, there's a high chance many Celery deployments implicitly depend on its current behavior. Perhaps, we could address this by making the new implementation opt-in while slowly deprecating the original one.I would love to get feedback about the suggested solutions so that we could agree on the best course of action. I am willing to provide some pseudo-code implementations for both solutions if needed.
Beta Was this translation helpful? Give feedback.
All reactions