Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hello folks!

Suppose we have Flyte workflow with some failed map_task. At the start, this task submits a lot(300-500) pods to Kubernetes, and one of them dies, so the whole workflow is failed. But the rest pods are still in Kubernetes and waiting to be executed.

Why does Flyte not kill the rest pods if one of them is failed? To be able to rerun this workflow, all these pods should be manually deleted from the Kubernetes cluster.

<@U04PVFWGL9J> the functionality you described is 100% what is expected. So when a Flyte task fails, propeller attempts to abort other running tasks so that Pods may be cleaned up, etc.

I suspect what is happening is similar to <https://github.com/flyteorg/flyte/issues/3239|this issue>. Basically, during the cleanup phase (Flyte aborting running tasks) if a task has already been marked as failed then <https://github.com/flyteorg/flytepropeller/blob/752f55e9c7f7e357de707aa3eaae6e3af03b186f/pkg/controller/nodes/task/handler.go#L795-L798|propeller skips the abort>. Since map tasks are built as a plugin, when one subtask fails, it marks the task as a failure and the abort fails to propogate down to other running subtasks.

Do you mind filing a bug and linking the above issue? This is something I plan on addressing in the next week or so. [flyte-bug]

:ladybug: Create a new Flyte Bug issue: <https://github.com/flyteorg/flyte/issues/new?assignees=&amp;labels=bug%2Cuntriaged&amp;template=bug_report.yaml&amp;title=%5BBUG%5D+>