Hey, we have a workflow where one of the task Pods...
# ask-the-community
h
Hey, we have a workflow where one of the task Pods are not starting. It started to happen after adjusting resource requests, but I can’t see that the Pod is being scheduled. If there were no worker nodes matching resources we should see Pod in pending state, waiting for a worker node to be scheduled up. Tailing all logs from
flyte
namespaces grepping after task name yields following logs (in comment).
Copy code
~ stern -i evaluate_model .
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"container_helper.go:283","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"Adjusted container resources [{map[cpu:{{1 0} {\u003cnil\u003e} 1 DecimalSI} memory:{{10737418240 0} {\u003cnil\u003e} 10Gi BinarySI} <http://nvidia.com/gpu:|nvidia.com/gpu:>{{1 0} {\u003cnil\u003e} 1 DecimalSI}] map[cpu:{{1 0} {\u003cnil\u003e} 1 DecimalSI} memory:{{10737418240 0} {\u003cnil\u003e} 10Gi BinarySI} <http://nvidia.com/gpu:|nvidia.com/gpu:>{{1 0} {\u003cnil\u003e} 1 DecimalSI}]}]","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"plugin_manager.go:207","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"Creating Object: Type:[/v1, Kind=pod], Object:[defect-detection-development/akmwln5vsvgk2tc4rbj6-fs3r6kdy-0]","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"plugin_manager.go:181","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"The resource requirement for creating Pod [defect-detection-development/akmwln5vsvgk2tc4rbj6-fs3r6kdy-0] is [[{[cpu]: [1]} {[memory]: [10Gi]} {[<http://nvidia.com/gpu|nvidia.com/gpu>]: [1]}]]\n","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"akmwln5vsvgk2tc4rbj6","node":"flyteworkflowstrainandevaluateevaluatemodel","ns":"defect-detection-development","res_ver":"437166569","routine":"worker-16","src":"controller.go:38","tasktype":"python-task","wf":"defect-detection:development:.flytegen.flyte.workflows.train_and_evaluate.evaluate_model"},"level":"info","msg":"The back-off handler for [/v1, Kind=pod,defect-detection-development] has been loaded.\n","ts":"2022-10-28T12:03:43Z"}
flytepropeller-54b4b6cd56-q2ms2 flytepropeller {"json":{"exec_id":"abrgvfvdnwh2jrgfr682","node":"n6","ns":"defect-detection-development","res_ver":"437079416","routine":"worker-23","src":"pre_post_execution.go:114","tasktype":"python-task","wf":"defect-detection:development:<http://flyte.workflows.train_and_evaluate.wf|flyte.workflows.train_and_evaluate.wf>"},"level":"info","msg":"Catalog CacheSerializeDisabled: for Task [defect-detection/development/flyte.workflows.train_and_evaluate.evaluate_model/bb7087e]","ts":"2022-10-28T12:03:43Z"}
d
@Hampus Rosvall thanks for the logs, it really helps debugging! So here is what I think is happening - the log line
The back-off handler for [/v1, Kind=pod,defect-detection-development] has been loaded
tells us that you have the backoff handler enabled in FlytePropeller. This is a construct in Flyte designed to reduce the number of API server calls. In this instance, we track the resource limits for a k8s namespace, if attempting to schedule a Pod will exceed those resources we backoff and try again. Can you confirm this by searching for the log message
The operation was blocked due to back-off
which should correspond with the same workflow execution id (ie.
akmwln5vsvgk2tc4rbj6
).
101 Views