I am giving the `kftensorflow` plugin a try and st...
# ask-the-community
l
I am giving the
kftensorflow
plugin a try and struggling to setup correctly. I don’t see issues on github, so I figured I would try here to see if anyone can catch anything. I am on flytepropeller 1.6.1 and flytekit 1.7.0. Details in 🧵
I am using tensorflows MultiWorkerMirrored strategy, which only requires workers as far as I understand where one is randomly chosen to be the chief. This is what I am seeing:
Copy code
@task(
    task_config=TfJob(
        worker=Worker(
            replicas=5,
            requests=Resources(cpu="15", mem="30Gi", gpu="2"),
            limits=Resources(cpu="15", mem="30Gi", gpu="2"),
            restart_policy=RestartPolicy.FAILURE,
        ),
        ps=PS(replicas=0),
        chief=Chief(replicas=0),
        run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
fails to schedule. In the training operator I am seeing:
Copy code
INFO    <http://TFJob.kubeflow.org|TFJob.kubeflow.org> "azg7k57nsljsq6fwkljk-n5-0-n5-n0-0" not found    {"tfjob": "redwood-development/azg7k57nsljsq6fwkljk-n5-0-n5-n0-0", "unable to fetch TFJob":
and in the console I am seeing
Copy code
RuntimeExecutionError: max number of system retry attempts [51/50] exhausted.
Then, I try to specify a chief in case my assumption about needing workers is incorrect:
Copy code
@task(
    task_config=TfJob(
        worker=Worker(
            replicas=5,
            requests=Resources(cpu="15", mem="30Gi", gpu="2"),
            limits=Resources(cpu="15", mem="30Gi", gpu="2"),
            restart_policy=RestartPolicy.FAILURE,
        ),
        ps=PS(replicas=0),
        worker=Chief(
            replicas=1,
            requests=Resources(cpu="15", mem="30Gi", gpu="2"),
            limits=Resources(cpu="15", mem="30Gi", gpu="2"),
            restart_policy=RestartPolicy.FAILURE,
        ),        run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
but I see the same issue that it can’t schedule and the training operator is sad. Then I try to specify a parameter server as follows:
Copy code
@task(
    task_config=TfJob(
        worker=Worker(
            replicas=5,
            requests=Resources(cpu="15", mem="30Gi", gpu="2"),
            limits=Resources(cpu="15", mem="30Gi", gpu="2"),
            restart_policy=RestartPolicy.FAILURE,
        ),
        ps=PS(
            replicas=1,
            requests=Resources(cpu="7", mem="15Gi", gpu="1"),
            limits=Resources(cpu="7", mem="15Gi", gpu="1"),
            restart_policy=RestartPolicy.NEVER,
        ),
        chief=Chief(
            replicas=1,
            requests=Resources(cpu="7", mem="15Gi", gpu="1"),
            limits=Resources(cpu="7", mem="15Gi", gpu="1"),
            restart_policy=RestartPolicy.ALWAYS,
        ),
        run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
    )
Then things schedule, but hang indefinitely, because my guess is that the multiworkermirrored strategy is confused why there is a PS in the TF_CONFIG.
cc: @Yubo Wang (I think you added this functionality) 🙏
k
yes also cc @Fabio Grätz who use this a lot
y
Let me reproduce your issue @Len Strnad
l
Our training-operator is v1.5.0 with the most basic deployment since we are testing this out for now
y
interesting, @Len Strnad can you do
kubectl describe tfjob
with
Copy code
@task(
    task_config=TfJob(
        worker=Worker(
            replicas=5,
            requests=Resources(cpu="15", mem="30Gi", gpu="2"),
            limits=Resources(cpu="15", mem="30Gi", gpu="2"),
            restart_policy=RestartPolicy.FAILURE,
        ),
        ps=PS(replicas=0),
        chief=Chief(replicas=0),
        run_policy=RunPolicy(clean_pod_policy=CleanPodPolicy.RUNNING),
)
ah I think the versions are messed up, can you try to use the propeller from the latest release?
l
Will
describe
in one sec, have to fast register and wait for nodes
Is there a point in time you want me to run
describe
?
y
just after the job launched
after your tfjob is created
l
Everything is dated for yesterday.
The workflow error is
Copy code
Workflow[redwood:development:flyte.redwood.tensorflow.train_eval_workflow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n5-n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [tensorflow]: panic when executing a plugin [tensorflow]. Stack: [goroutine 1063 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:389 +0xfe
panic({0x2291e00, 0x3fb42a0})
	/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow.tensorflowOperatorResourceHandler.GetTaskPhase({}|github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow.tensorflowOperatorResourceHandler.GetTaskPhase({}>, {0x0?, 0xc0097e7f68?}, {0x7fd9e339c108?, 0xc0049c2c90?}, {0x2bd3228?, 0xc008799d40})
	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.67/go/tasks/plugins/k8s/kfoperators/tensorflow/tensorflow.go:203 +0xc8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.(*PluginManager).CheckResourcePhase(0xc00e41e518|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.(*PluginManager).CheckResourcePhase(0xc00e41e518>, {0x2bbeb18, 0xc008797f80}, {0x2bcaf00, 0xc008ea80c0}, 0xc011b2c248)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:283 +0xc83
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.PluginManager.Handle({{0x268d1e9|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s.PluginManager.Handle({{0x268d1e9>, 0xa}, {0x2bbfbb8, 0x401b590}, {0x2ba3e00, 0xc0006d3860}, {0x7fd9bc1f5780, 0xc000d6a4e0}, {{0x2bcf670, 0xc000dd5c90}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:338 +0x685
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x19|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x19>?, {0x2bbeb18, 0xc008797d10}, {0x2bc07d8?, 0xc001e54140?}, 0x3f658e8?)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:396 +0x184
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2bc0798|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2bc0798>, 0xc0013fab10}, {0x2baac78, 0xc00152d180}, 0xc001549f80, 0xc001549fb0, 0xc00163a000, {0x2bc07d8, 0xc0021a6140}, 0xc000621080, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:398 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2bc0798|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2bc0798>, 0xc0013fab10}, {0x2baac78, 0xc00152d180}, 0xc001549f80, 0xc001549fb0, 0xc00163a000, {0x2bc07d8, 0xc0021a6140}, 0xc000621080, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:666 +0x1ba5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2bc2210|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2bc2210>, 0xc00138bd40}, {{0xc001638310, {{...}, 0x0}, {0xc0005aa5c0, 0x4, 0x4}}, {0xc001638330, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2bc2210|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2bc2210>, 0xc00138bd40}, {{0xc001638310, {{...}, 0x0}, {0xc0005aa5c0, 0x4, 0x4}}, {0xc001638330, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:224 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980>, {0x2bbeb18, 0xc008797890}, {0x2bc05d8, 0xc0013cf180}, 0xc008ea8000, {0x2bd4d70?, 0xc008f03e10?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:460 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980>, {0x2bbeb18, 0xc008797890}, 0xc008ea8000, {0x2bc05d8?, 0xc0013cf180?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:593 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980>, {0x2bbeb18, 0xc008797890}, {0x2baade0, 0xc008797860}, 0xc008ea8000, {0x2bc05d8?, 0xc0013cf180})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:820 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2bcf238, 0xc005e3d4f0}, {0x2baade0, 0xc008797860}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1018 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).recurseDownstream(0xc00163a840|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).recurseDownstream(0xc00163a840>, {0x2bbeb18, 0xc008797350}, {0x2bccc90, 0xc008dfda40}, {0x2bd4d70, 0xc008f03ba0}, {0x2bccb50?, 0xc00c17c900?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:148 +0x409
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).HandleBranchNode(0xc005a5a4b0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).HandleBranchNode(0xc005a5a4b0>?, {0x2bbeb18, 0xc008797350}, {0x2bbedf0, 0xc008f51e00}, {0x2bccc90?, 0xc008dfda40?}, {0x2bbee28, 0xc0027a5400})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:103 +0x949
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).Handle(0x269326a|github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch.(*branchHandler).Handle(0x269326a>?, {0x2bbeb18, 0xc008797350}, {0x2bccc90?, 0xc008dfda40?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/branch/handler.go:115 +0x137
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2bc0598, 0xc00163a840}, 0xc008dfda40, {0x2bd4d70?, 0xc008f03ba0?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:460 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc00138b980>, {0x2bbeb18, 0xc008797350}, 0xc008dfda40, {0x2bc0598?, 0xc00163a840?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:593 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc00138b980>, {0x2bbeb18, 0xc008797350}, {0x2ba3040, 0xc0027a5400}, 0xc008dfda40, {0x2bc0598?, 0xc00163a840})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:820 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1018 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26d14bd|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26d14bd>?, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400?}, {0x2bbee28?, 0xc0027a5400}, {0x2bccb50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:858 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc00138b980>, {0x2bbeb18, 0xc00bd6a750}, {0x2bcf238, 0xc008545630}, {0x2ba3040, 0xc0027a5400}, {0x2bbee28?, 0xc0027a5400?}, {0x2bccb50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1025 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0004762a0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0004762a0>, {0x2bbeb18, 0xc00bd6a750}, 0xc0027a5400)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:147 +0x1b3
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0004762a0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0004762a0>, {0x2bbeb18, 0xc00bd6a750}, 0xc0027a5400)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:393 +0x40f
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc00167e300>, {0x2bbeb18, 0xc00bd6a750}, 0xc00e4277d0, 0x214b2a0?)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:142 +0x18e
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc00167e300>, {0x2bbeb18, 0xc00bd6a2a0}, 0xc0027a4a00)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:143 +0x495
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc00167e300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc00167e300>, {0x2bbeb18, 0xc00bd6a2a0}, {0xc003165710, 0x13}, {0xc003165724, 0x14})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:259 +0xe4a
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc000b39170|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc000b39170>, 0xc00e427f28, {0x214b2a0?, 0xc0091943a0})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:88 +0x510
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc000b39170|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc000b39170>, {0x2bbeb18, 0xc00bd6a2a0})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:99 +0xf1
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2bbeb18|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2bbeb18>?, {0x2bbeb18, 0xc002b32120})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:115 +0xbd
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:150 +0x59
created by <http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:147 +0x285
]
y
interesting, so the tfjob is never created
l
I think things are failing when trying to create the tfjob crd for this job and therefore doesn’t register at all. Just a guess
yeah
y
can you try the latest release of propeller?
l
Yeah, that is what I will try next, but I won’t be able to until next week and might need help from @Bernhard Stadlbauer (my colleague).
I’ll try that and post back here in a few days or sooner if possible!
y
I will improve the doc on matching versions of flytekit and flytepropeller later today
l
Sweet. Thanks a ton for the help. Im very excited to get this up and running.
Not to derail, but are there any plans to support an update to tf cluster replicas? It would be nice to let the task args define the number of workers. I suppose one can always modify the deployment with python kubernetes in the task body.
k
Aaha, just for this @Fabio Grätz / @Bernhard Stadlbauer and @Byron Hsu are working on config overrides
l
We are on flyte propeller version
v1.1.96
. Do we think a possible fix has been introduced since? I’ll try to upgrade to
v1.1.98
in any case.
y
that propeller version should be good, do you have more logs you can provide?
l
1.1.96
should be good?
Possibly, I’ll see what comes up when I try
1.1.98
today.
y
my versions are: flytekit:1.7.0 and propeller: 1.1.96 and it is working for me. I can investigate more with more logs
l
Are you using
interruptible=True
for your task?
y
no
I can test with that too
l
We just added a non-interruptible node pool, I’ll try with interruptible=False just in case at some point today.
113 Views