Hey we have a workflow where one of the task Pods are not st Flyte #flyte-support

Hey, we have a workflow where one of the task Pods...

helpful-crowd-74546

10/28/2022, 12:07 PM

Hey, we have a workflow where one of the task Pods are not starting. It started to happen after adjusting resource requests, but I can’t see that the Pod is being scheduled. If there were no worker nodes matching resources we should see Pod in pending state, waiting for a worker node to be scheduled up. Tailing all logs from

flyte

namespaces grepping after task name yields following logs (in comment).

helpful-crowd-74546

10/28/2022, 12:08 PM

Copy code

~ stern -i evaluate_model .
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"container_helper.go:283","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"Adjusted container resources [{map[cpu:{{1 0} {\u003cnil\u003e} 1 DecimalSI} memory:{{10737418240 0} {\u003cnil\u003e} 10Gi BinarySI} <http://nvidia.com/gpu:|nvidia.com/gpu:>{{1 0} {\u003cnil\u003e} 1 DecimalSI}] map[cpu:{{1 0} {\u003cnil\u003e} 1 DecimalSI} memory:{{10737418240 0} {\u003cnil\u003e} 10Gi BinarySI} <http://nvidia.com/gpu:|nvidia.com/gpu:>{{1 0} {\u003cnil\u003e} 1 DecimalSI}]}]","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"plugin_manager.go:207","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"Creating Object: Type:[/v1, Kind=pod], Object:[defect-detection-development/akmwln5vsvgk2tc4rbj6-fs3r6kdy-0]","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"plugin_manager.go:181","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"The resource requirement for creating Pod [defect-detection-development/akmwln5vsvgk2tc4rbj6-fs3r6kdy-0] is [[{[cpu]: [1]} {[memory]: [10Gi]} {[<http://nvidia.com/gpu|nvidia.com/gpu>]: [1]}]]\n","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"controller.go:38","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"The back-off handler for [/v1, Kind=pod,defect-detection-development] has been loaded.\n","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"abrgvfvdnwh2jrgfr682","node":"n6","ns":"defect-detection-development","res_ver":"437079416","routine":"worker-23","src":"pre_post_execution.go:114","tasktype":"python-task","wf":"defect-detection:development:<http://flyte.workflows.train_and_evaluate.wf|flyte.workflows.train_and_evaluate.wf>"},"level":"info","msg":"Catalog CacheSerializeDisabled: for Task [defect-detection/development/flyte.workflows.train_and_evaluate.evaluate_model/bb7087e]","ts":"2022-10-28T12:03:43Z"}

hallowed-mouse-14616

10/28/2022, 1:29 PM

@helpful-crowd-74546 thanks for the logs, it really helps debugging! So here is what I think is happening - the log line

The back-off handler for [/v1, Kind=pod,defect-detection-development] has been loaded

tells us that you have the backoff handler enabled in FlytePropeller. This is a construct in Flyte designed to reduce the number of API server calls. In this instance, we track the resource limits for a k8s namespace, if attempting to schedule a Pod will exceed those resources we backoff and try again. Can you confirm this by searching for the log message

The operation was blocked due to back-off

which should correspond with the same workflow execution id (ie.

akmwln5vsvgk2tc4rbj6

158 Views

Open in Slack

Previous Next