Len Strnad
06/16/2023, 3:53 PMkftensorflow
plugin a try and struggling to setup correctly. I don’t see issues on github, so I figured I would try here to see if anyone can catch anything. I am on flytepropeller 1.6.1 and flytekit 1.7.0. Details in 🧵@task(
task_config=TfJob(
worker=Worker(
replicas=5,
requests=Resources(cpu="15", mem="30Gi", gpu="2"),
limits=Resources(cpu="15", mem="30Gi", gpu="2"),
restart_policy=RestartPolicy.FAILURE,
),
ps=PS(replicas=0),
chief=Chief(replicas=0),
run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
fails to schedule. In the training operator I am seeing:
INFO <http://TFJob.kubeflow.org|TFJob.kubeflow.org> "azg7k57nsljsq6fwkljk-n5-0-n5-n0-0" not found {"tfjob": "redwood-development/azg7k57nsljsq6fwkljk-n5-0-n5-n0-0", "unable to fetch TFJob":
and in the console I am seeing
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted.
Then, I try to specify a chief in case my assumption about needing workers is incorrect:
@task(
task_config=TfJob(
worker=Worker(
replicas=5,
requests=Resources(cpu="15", mem="30Gi", gpu="2"),
limits=Resources(cpu="15", mem="30Gi", gpu="2"),
restart_policy=RestartPolicy.FAILURE,
),
ps=PS(replicas=0),
worker=Chief(
replicas=1,
requests=Resources(cpu="15", mem="30Gi", gpu="2"),
limits=Resources(cpu="15", mem="30Gi", gpu="2"),
restart_policy=RestartPolicy.FAILURE,
), run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
but I see the same issue that it can’t schedule and the training operator is sad.
Then I try to specify a parameter server as follows:
@task(
task_config=TfJob(
worker=Worker(
replicas=5,
requests=Resources(cpu="15", mem="30Gi", gpu="2"),
limits=Resources(cpu="15", mem="30Gi", gpu="2"),
restart_policy=RestartPolicy.FAILURE,
),
ps=PS(
replicas=1,
requests=Resources(cpu="7", mem="15Gi", gpu="1"),
limits=Resources(cpu="7", mem="15Gi", gpu="1"),
restart_policy=RestartPolicy.NEVER,
),
chief=Chief(
replicas=1,
requests=Resources(cpu="7", mem="15Gi", gpu="1"),
limits=Resources(cpu="7", mem="15Gi", gpu="1"),
restart_policy=RestartPolicy.ALWAYS,
),
run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
Then things schedule, but hang indefinitely, because my guess is that the multiworkermirrored strategy is confused why there is a PS in the TF_CONFIG.Ketan (kumare3)
Yubo Wang
06/16/2023, 4:59 PMLen Strnad
06/16/2023, 5:12 PMYubo Wang
06/16/2023, 5:21 PMkubectl describe tfjob
with
@task(
task_config=TfJob(
worker=Worker(
replicas=5,
requests=Resources(cpu="15", mem="30Gi", gpu="2"),
limits=Resources(cpu="15", mem="30Gi", gpu="2"),
restart_policy=RestartPolicy.FAILURE,
),
ps=PS(replicas=0),
chief=Chief(replicas=0),
run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
Len Strnad
06/16/2023, 5:27 PMdescribe
in one sec, have to fast register and wait for nodesdescribe
?Yubo Wang
06/16/2023, 5:30 PMLen Strnad
06/16/2023, 5:37 PMWorkflow[redwood:development:flyte.redwood.tensorflow.train_eval_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n5-n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [tensorflow]: panic when executing a plugin [tensorflow]. Stack: [goroutine 1063 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:389 +0xfe
panic({0x2291e00, 0x3fb42a0})
/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow.tensorflowOperatorResourceHandler.GetTaskPhase({}|github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow.tensorflowOperatorResourceHandler.GetTaskPhase({}>, {0x0?, 0xc0097e7f68?}, {0x7fd9e339c108?, 0xc0049c2c90?}, {0x2bd3228?, 0xc008799d40})
/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.67/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow.go:203 +0xc8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.(*PluginManager).CheckResourcePhase(0xc00e41e518|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.(*PluginManager).CheckResourcePhase(0xc00e41e518>, {0x2bbeb18, 0xc008797f80}, {0x2bcaf00, 0xc008ea80c0}, 0xc011b2c248)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:283 +0xc83
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.PluginManager.Handle({{0x268d1e9|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.PluginManager.Handle({{0x268d1e9>, 0xa}, {0x2bbfbb8, 0x401b590}, {0x2ba3e00, 0xc0006d3860}, {0x7fd9bc1f5780, 0xc000d6a4e0}, {{0x2bcf670, 0xc000dd5c90}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:338 +0x685
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x19|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x19>?, {0x2bbeb18, 0xc008797d10}, {0x2bc07d8?, 0xc001e54140?}, 0x3f658e8?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:396 +0x184
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2bc0798|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2bc0798>, 0xc0013fab10}, {0x2baac78, 0xc00152d180}, 0xc001549f80, 0xc001549fb0, 0xc00163a000, {0x2bc07d8, 0xc0021a6140}, 0xc000621080, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:398 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2bc0798|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2bc0798>, 0xc0013fab10}, {0x2baac78, 0xc00152d180}, 0xc001549f80, 0xc001549fb0, 0xc00163a000, {0x2bc07d8, 0xc0021a6140}, 0xc000621080, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:666 +0x1ba5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2bc2210|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2bc2210>, 0xc00138bd40}, {{0xc001638310, {{...}, 0x0}, {0xc0005aa5c0, 0x4, 0x4}}, {0xc001638330, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2bc2210|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2bc2210>, 0xc00138bd40}, {{0xc001638310, {{...}, 0x0}, {0xc0005aa5c0, 0x4, 0x4}}, {0xc001638330, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:224 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980>, {0x2bbeb18, 0xc008797890}, {0x2bc05d8, 0xc0013cf180}, 0xc008ea8000, {0x2bd4d70?, 0xc008f03e10?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:460 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980>, {0x2bbeb18, 0xc008797890}, 0xc008ea8000, {0x2bc05d8?, 0xc0013cf180?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:593 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980>, {0x2bbeb18, 0xc008797890}, {0x2baade0, 0xc008797860}, 0xc008ea8000, {0x2bc05d8?, 0xc0013cf180})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:820 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2bcf238, 0xc005e3d4f0}, {0x2baade0, 0xc008797860}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1018 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).recurseDownstream(0xc00163a840|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).recurseDownstream(0xc00163a840>, {0x2bbeb18, 0xc008797350}, {0x2bccc90, 0xc008dfda40}, {0x2bd4d70, 0xc008f03ba0}, {0x2bccb50?, 0xc00c17c900?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:148 +0x409
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).HandleBranchNode(0xc005a5a4b0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).HandleBranchNode(0xc005a5a4b0>?, {0x2bbeb18, 0xc008797350}, {0x2bbedf0, 0xc008f51e00}, {0x2bccc90?, 0xc008dfda40?}, {0x2bbee28, 0xc0027a5400})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:103 +0x949
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).Handle(0x269326a|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).Handle(0x269326a>?, {0x2bbeb18, 0xc008797350}, {0x2bccc90?, 0xc008dfda40?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:115 +0x137
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2bc0598, 0xc00163a840}, 0xc008dfda40, {0x2bd4d70?, 0xc008f03ba0?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:460 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980>, {0x2bbeb18, 0xc008797350}, 0xc008dfda40, {0x2bc0598?, 0xc00163a840?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:593 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2ba3040, 0xc0027a5400}, 0xc008dfda40, {0x2bc0598?, 0xc00163a840})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:820 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1018 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26d14bd|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26d14bd>?, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400?}, {0x2bbee28?, 0xc0027a5400}, {0x2bccb50, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:858 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1025 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0004762a0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0004762a0>, {0x2bbeb18, 0xc00bd6a750}, 0xc0027a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:147 +0x1b3
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0004762a0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0004762a0>, {0x2bbeb18, 0xc00bd6a750}, 0xc0027a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:393 +0x40f
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc00167e300>, {0x2bbeb18, 0xc00bd6a750}, 0xc00e4277d0, 0x214b2a0?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:142 +0x18e
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc00167e300>, {0x2bbeb18, 0xc00bd6a2a0}, 0xc0027a4a00)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:143 +0x495
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc00167e300>, {0x2bbeb18, 0xc00bd6a2a0}, {0xc003165710, 0x13}, {0xc003165724, 0x14})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:259 +0xe4a
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc000b39170|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc000b39170>, 0xc00e427f28, {0x214b2a0?, 0xc0091943a0})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:88 +0x510
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc000b39170|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc000b39170>, {0x2bbeb18, 0xc00bd6a2a0})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:99 +0xf1
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2bbeb18|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2bbeb18>?, {0x2bbeb18, 0xc002b32120})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:115 +0xbd
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:150 +0x59
created by <http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:147 +0x285
]
Yubo Wang
06/16/2023, 5:43 PMLen Strnad
06/16/2023, 5:43 PMYubo Wang
06/16/2023, 5:44 PMLen Strnad
06/16/2023, 5:45 PMYubo Wang
06/16/2023, 5:46 PMLen Strnad
06/16/2023, 5:47 PMKetan (kumare3)
Len Strnad
06/20/2023, 3:16 PMv1.1.96
. Do we think a possible fix has been introduced since? I’ll try to upgrade to v1.1.98
in any case.Yubo Wang
06/20/2023, 5:14 PMLen Strnad
06/20/2023, 5:15 PM1.1.96
should be good?1.1.98
today.Yubo Wang
06/20/2023, 5:17 PMLen Strnad
06/20/2023, 5:19 PMinterruptible=True
for your task?Yubo Wang
06/20/2023, 5:19 PMLen Strnad
06/20/2023, 5:20 PM