https://flyte.org logo
Title
m

Mike Zhong

08/02/2022, 12:57 PM
Hello flyte team, we are encountering the following error when running one of our larger map-tasks. We did not encounter this error on previous map-tasks but we are seeing this reproducibly (last 2 runs) now. Any thoughts on the cause?
eventually it hits the system retry limit of 50 and crashes
s

Samhita Alla

08/02/2022, 1:06 PM
cc: @Dan Rammer (hamersaw)
👀 1
d

Dan Rammer (hamersaw)

08/02/2022, 1:07 PM
@Mike Zhong it sounds like this only happening on a larger map task and no others? Did you recently upgrade any components?
/go/src/github.com/flyterog/flytepropeller/pkg/controller/nodes/task/handler.go:487
this line has to do with checking for existence of the newer Flyte Deck stuff. but it is peculiar that this would only fail in a single instance.
m

Mike Zhong

08/02/2022, 1:16 PM
we have not recently updated any components. This particular map task, in our test setting, fanned out 500 tasks
but we have other map tasks in our “pipeline” which fan out to a greater or similar degree
those did not encounter this error
we are in the process of adding additional logging, and enabling cache (most of our other tasks are cache enabled)
d

Dan Rammer (hamersaw)

08/02/2022, 1:20 PM
Sure, what version of FlytePropeller do you have running?
m

Mike Zhong

08/02/2022, 1:21 PM
looks like v1.1.0
✔️ 1
d

Dan Rammer (hamersaw)

08/02/2022, 3:10 PM
cc @Kevin Su looks like this is the line that panics, thoughts?
it seems like there may be a missing nil check in there somewhere.
k

Kevin Su

08/04/2022, 11:05 AM
@Mike Zhong Any other logs you have? (like the log of map task ). I tried to run a map task fanned out 1000 tasks, but didn’t get any error.
Update: I tried to run larger map task fanned out 10000 tasks, and got below error. using the same example above but I changed the input to 10000. it panics at this line. cc @Dan Rammer (hamersaw)
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:375 +0xfe
panic({0x1f45600, 0x3959500})
	/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)|github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)>
	/go/pkg/mod/github.com/flyteorg/flytestdlib@v1.0.4/bitarray/bitset.go:33
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279ca70|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279ca70>, 0xc006688a50}, {0x27a8360?, 0xc00154d760?}, 0xc005fd3440, 0x23d7cd8)
	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.5/go/tasks/plugins/array/core/metadata.go:33 +0x1e1
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f33215a0ff0|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f33215a0ff0>, 0xc000a7e380}, {{0x278fa10, 0xc00174a0b0}}, {{0x278fa10, 0xc00174a160}}}, {0x279ca70, 0xc006688a50}, {0x27a8360, 0xc00154d760})
	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.5/go/tasks/plugins/array/k8s/executor.go:94 +0x225
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0>?, {0x279ca70, 0xc006688810}, {0x279ef58?, 0xc000bfe240?}, 0x0?)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:382 +0x178
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279cdb8|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279cdb8>, 0xc000a5d8d8}, {0x278a5a8, 0xc0009b4aa0}, 0xc0009c7260, 0xc0009c7290, 0xc0009c72c0, {0x279e4d8, 0xc0007f8000}, 0xc0009fc000, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:384 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279cdb8|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279cdb8>, 0xc000a5d8d8}, {0x278a5a8, 0xc0009b4aa0}, 0xc0009c7260, 0xc0009c7290, 0xc0009c72c0, {0x279e4d8, 0xc0007f8000}, 0xc0009fc000, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:616 +0x182b
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x279fe08|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x279fe08>, 0xc0009f00d0}, {{0xc000c86160, {{...}, 0x0}, {0xc000484080, 0x4, 0x4}}, {0xc000c86180, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x279fe08|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x279fe08>, 0xc0009f00d0}, {{0xc000c86160, {{...}, 0x0}, {0xc000484080, 0x4, 0x4}}, {0xc000c86180, {{...}, ...}, ...}, ...}, ...}, ...)
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:220 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0009e0000>, {0x279ca70, 0xc006688330}, {0x279e358, 0xc000734000}, 0xc00808dec0, {0x27b0d58?, 0xc008983110?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:382 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0009e0000>, {0x279ca70, 0xc006688330}, 0xc00808dec0, {0x279e358?, 0xc000734000?})
	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:512 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0009e0000>, {0x279ca70, 0xc006688330}, {0x2782f10, 0xc005f9cf00}, 0xc00808dec0, {0x279e358?, 0xc000734000})
d

Dan Rammer (hamersaw)

08/04/2022, 3:49 PM
@Kevin Su aws-batch or k8s-array plugin? This doesn't seem related to Mike's issue. but we should still resolve. Create an issue?
m

Mike Zhong

08/04/2022, 4:31 PM
Hi @Kevin Su . Here is the log for one of the mapped tasks, unfortunately it’s not particularly helpful, we didn’t have the logger set so we don’t have an indication where in our task it failed, but we suspect it failed after it completed. I’m not sure what
panic when reconciling workflow
means but if you point me to what could throw that error, I could dig a little more
that error you see is handled, it’s more of a warning that we are adding handlers to a non-root logger
k

Kevin Su

08/05/2022, 11:03 AM
I just created a PR to fix it, the problem is that tCtx.ow.GetReader() is nil when running map tasks with no output, and it causes nil pointer dereference panic. https://github.com/flyteorg/flytepropeller/pull/465
👀 1
m

Mike Zhong

08/05/2022, 3:10 PM
interesting root cause, I looked at our mapped out task and we do return an
int
. It’s not captured or used though. We ran into an issue where trying to use
.with_overrides()
with a
map_task(task)
where
task
returns nothing fails to compile with
VoidPromise has no attribute with_overrides
. We suspected there was something different between
VoidPromise
and
Promise
so we made sure all our map tasks returned something, even if it is just a sentinel value. I’d like to see if this fix resolves our issue
👍 1
d

Dan Rammer (hamersaw)

08/05/2022, 6:16 PM
@Mike Zhong it sounds like there may be another bug there with flytekit construction of map tasks. I'm not sure if
with_overrides
is currently supported, but it probably should be. @Kevin Su thanks for the backend fix, do you know anything about the flytekit side?
k

Kevin Su

08/05/2022, 8:03 PM
@Mike Zhong I just created a PR to support overriding the resource of voidPromise. https://github.com/flyteorg/flytekit/pull/1127
👀 1