https://flyte.org logo
#ask-the-community
Title
# ask-the-community
a

Anooj Patel

05/16/2023, 9:38 PM
Hey yall! I'm running into the same stack trace outlined on this issue, and seemed to be addressed. I'm on a recent version of flyte (propeller, data catalog, admin 1.1.78), but trying to use map task for 30k inputs with concurrency=10. Am I stressing the k8s-array beyond it's limitation?
y

Yee

05/17/2023, 1:25 AM
can you post the specific stack trace you’re seeing? even if it looks almost exactly the same as the one in the issue, unless the versions are exactly the same, things like line numbers will have changed, etc. it’s helpful to see exactly what’s going on.
a

Anooj Patel

05/17/2023, 3:24 PM
Sure!
Copy code
Workflow[apfelstrudel:development:apfelstrudel.flyte.workflows.search_workflow_fragmentlevel.hpo_deeplearning_cpu] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: panic when executing a plugin [k8s-array]. Stack: [goroutine 232 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:390 +0xfe
panic({0x2276320, 0x3f6e0d0})
	/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)|github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)>
	/go/pkg/mod/github.com/flyteorg/flytestdlib@v1.0.15/bitarray/bitset.go:33
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x2b96d18|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x2b96d18>, 0xc01dd23a70}, {0x2ba3080?, 0xc00b0aa2c0?}, 0xc01652ab40, 0x279b758)
	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.45/go/tasks/plugins/array/core/metadata.go:33 +0x1e1
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f74cd1c7128|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f74cd1c7128>, 0xc0007ae680}, {{0x2b89030, 0xc004753760}}, {{0x2b89030, 0xc004753810}}}, {0x2b96d18, 0xc01dd23a70}, {0x2ba3080, 0xc00b0aa2c0})
	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.45/go/tasks/plugins/array/k8s/executor.go:96 +0x268
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x1|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x1>?, {0x2b96d18, 0xc01dd236e0}, {0x2b99458?, 0xc0048ee810?}, 0x0?)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:397 +0x178
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2b98958|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x2b98958>, 0xc0014c90f8}, {0x2b83398, 0xc0014baa60}, 0xc001683920, 0xc001683950, 0xc001683980, {0x2b98998, 0xc020838640}, 0xc0003a4000, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:399 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2b98958|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x2b98958>, 0xc0014c90f8}, {0x2b83398, 0xc0014baa60}, 0xc001683920, 0xc001683950, 0xc001683980, {0x2b98998, 0xc020838640}, 0xc0003a4000, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:672 +0x1de5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2b9a388|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x2b9a388>, 0xc0004000d0}, {{0xc0005b29c0, {{...}, 0x0}, {0xc000fa4000, 0x4, 0x4}}, {0xc0005b29e0, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2b9a388|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x2b9a388>, 0xc0004000d0}, {{0xc0005b29c0, {{...}, 0x0}, {0xc000fa4000, 0x4, 0x4}}, {0xc0005b29e0, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:224 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0003e20c0>, {0x2b96d18, 0xc01dd23050}, {0x2b98798, 0xc000a38280}, 0xc0318a0000, {0x2baced0?, 0xc004faa4e0?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:460 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0003e20c0>, {0x2b96d18, 0xc01dd23050}, 0xc0318a0000, {0x2b98798?, 0xc000a38280?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:593 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0003e20c0>, {0x2b96d18, 0xc01dd23050}, {0x2b7b760, 0xc00a42cf00}, 0xc0318a0000, {0x2b98798?, 0xc000a38280})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:820 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0>, {0x2b96d18, 0xc01dd22a20}, {0x2ba7398, 0xc043119770}, {0x2b7b760, 0xc00a42cf00}, {0x2b96ff0?, 0xc00a42cf00?}, {0x2ba4d50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1018 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26ae306|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26ae306>?, {0x2b96d18, 0xc01dd22a20}, {0x2ba7398, 0xc043119770}, {0x2b7b760, 0xc00a42cf00?}, {0x2b96ff0?, 0xc00a42cf00}, {0x2ba4d50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:858 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0>, {0x2b96d18, 0xc01dd22a20}, {0x2ba7398, 0xc043119770}, {0x2b7b760, 0xc00a42cf00}, {0x2b96ff0?, 0xc00a42cf00?}, {0x2ba4d50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1025 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26ae306|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x26ae306>?, {0x2b96d18, 0xc01dd22a20}, {0x2ba7398, 0xc043119770}, {0x2b7b760, 0xc00a42cf00?}, {0x2b96ff0?, 0xc00a42cf00}, {0x2ba4d50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:858 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc0003e20c0>, {0x2b96d18, 0xc01dd22a20}, {0x2ba7398, 0xc043119770}, {0x2b7b760, 0xc00a42cf00}, {0x2b96ff0?, 0xc00a42cf00?}, {0x2ba4d50, ...})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:1025 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0008a9dc0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc0008a9dc0>, {0x2b96d18, 0xc01dd22a20}, 0xc00a42cf00)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:147 +0x1b3
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0008a9dc0|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc0008a9dc0>, {0x2b96d18, 0xc01dd22a20}, 0xc00a42cf00)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:393 +0x40f
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000b93300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000b93300>, {0x2b96d18, 0xc01dd22a20}, 0xc00567b7d0, 0x2130000?)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:142 +0x18e
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000b93300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000b93300>, {0x2b96d18, 0xc01dd225d0}, 0xc013aa6500)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:143 +0x495
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000b93300|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000b93300>, {0x2b96d18, 0xc01dd225d0}, {0xc01764c9c0, 0x18}, {0xc01764c9d9, 0x14})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:259 +0xe4a
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc00032b9e0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc00032b9e0>, 0xc00567bf28, {0x2130000?, 0xc045a12cc0})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:88 +0x510
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc00032b9e0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc00032b9e0>, {0x2b96d18, 0xc01dd225d0})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:99 +0xf1
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2b96d18|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x2b96d18>?, {0x2b96d18, 0xc019212030})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:115 +0xbd
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:150 +0x59
created by <http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:147 +0x285
]
is the error at the top of the workflow panel and:
Copy code
max number of system retry attempts [51/50] exhausted - system failure.
for the map task side panel
For versions:
Copy code
> kubectl describe po | grep Image:
    Image:         <http://gcr.io/cloudsql-docker/gce-proxy:1.14|gcr.io/cloudsql-docker/gce-proxy:1.14>
    Image:         <http://gcr.io/cloudsql-docker/gce-proxy:1.14|gcr.io/cloudsql-docker/gce-proxy:1.14>
    Image:         <http://gcr.io/cloudsql-docker/gce-proxy:1.14|gcr.io/cloudsql-docker/gce-proxy:1.14>
    Image:         <http://cr.flyte.org/flyteorg/datacatalog-release:v1.4.3|cr.flyte.org/flyteorg/datacatalog-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/datacatalog-release:v1.4.3|cr.flyte.org/flyteorg/datacatalog-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3|cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3|cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3|cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3|cr.flyte.org/flyteorg/flyteadmin-release:v1.4.3>
    Image:          <http://cr.flyte.org/flyteorg/flyteconsole-release:v1.4.3|cr.flyte.org/flyteorg/flyteconsole-release:v1.4.3>
    Image:         <http://cr.flyte.org/flyteorg/flytepropeller:v1.1.78|cr.flyte.org/flyteorg/flytepropeller:v1.1.78>
d

Dan Rammer (hamersaw)

05/17/2023, 7:54 PM
Hey @Anooj Patel this certainly seems like a bug. Very similar to the one you linked, but should be an easier fix. In the logs do you see anything like
array size > max allowed. requested [%v]. allowed [%v]
? My guess is that this lookup phase is returning an error which is not caught here. will submit a bug fix soon - but want to see if we can workaround.
y

Yee

05/17/2023, 7:54 PM
i see…
give us some time to investigate
give dan rather some time to investigate 🙂
a

Anooj Patel

05/17/2023, 8:08 PM
@Dan Rammer (hamersaw) @Yee thanks for the heads up! I'll relaunch and rummage through the logs for it...
d

Dan Rammer (hamersaw)

05/17/2023, 8:35 PM
This maximum array size is also configured with the
maxArrayJobSize
value at something like
plugins.k8s-array.maxArrayJobSize
and then other specific configuration if caching is enabled (is it?). The default value is like 5000 but I'm sure you've increased it.
just to confirm, i was able to reproduce this locally with a job that exceeds the
maxArrayJobSize
parameter.
a

Anooj Patel

05/17/2023, 10:46 PM
I'm a newb with parsing logs with k9s and gcp, lemme try again. I'll reach out to our infra team and try to up this value. I'll report back here -- thanks @Dan Rammer (hamersaw)!
in terms of flyte propeller <> datacatalog robustness, would we expect strange behavior if we up the limit to 30k?
d

Dan Rammer (hamersaw)

05/18/2023, 8:07 PM
There should be nothing to worry about with the scale. In maptasks basically what happens is that it begins a sequential cache lookup process in the background, then periodically checks if it has completed. None of the subtasks will be started until the cache lookup for all of them is complete. So it may take a little longer to start truly executing the subtasks, but there shouldn't be any correctness or robustness issues.
a

Anooj Patel

05/18/2023, 8:25 PM
amazing! Thanks Dan!
d

Dan Rammer (hamersaw)

05/22/2023, 8:06 PM
fix for the panic is submitted - https://github.com/flyteorg/flyteplugins/pull/352
72 Views