Mike Zhong

    Mike Zhong

    1 month ago
    Hello flyte team, we are encountering the following error when running one of our larger map-tasks. We did not encounter this error on previous map-tasks but we are seeing this reproducibly (last 2 runs) now. Any thoughts on the cause?
    eventually it hits the system retry limit of 50 and crashes
    Samhita Alla

    Samhita Alla

    1 month ago
    cc: @Dan Rammer (hamersaw)
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    @Mike Zhong it sounds like this only happening on a larger map task and no others? Did you recently upgrade any components?
    /go/src/github.com/flyterog/flytepropeller/pkg/controller/nodes/task/handler.go:487
    this line has to do with checking for existence of the newer Flyte Deck stuff. but it is peculiar that this would only fail in a single instance.
    Mike Zhong

    Mike Zhong

    1 month ago
    we have not recently updated any components. This particular map task, in our test setting, fanned out 500 tasks
    but we have other map tasks in our “pipeline” which fan out to a greater or similar degree
    those did not encounter this error
    we are in the process of adding additional logging, and enabling cache (most of our other tasks are cache enabled)
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    Sure, what version of FlytePropeller do you have running?
    Mike Zhong

    Mike Zhong

    1 month ago
    looks like v1.1.0
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    cc @Kevin Su looks like this is the line that panics, thoughts?
    it seems like there may be a missing nil check in there somewhere.
    Kevin Su

    Kevin Su

    1 month ago
    @Mike Zhong Any other logs you have? (like the log of map task ). I tried to run a map task fanned out 1000 tasks, but didn’t get any error.
    Update: I tried to run larger map task fanned out 10000 tasks, and got below error. using the same example above but I changed the input to 10000. it panics at this line. cc @Dan Rammer (hamersaw)
    /usr/local/go/src/runtime/debug/stack.go:24 +0x65
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:375 +0xfe
    panic({0x1f45600, 0x3959500})
    	/usr/local/go/src/runtime/panic.go:838 +0x207
    <http://github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)|github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)>
    	/go/pkg/mod/github.com/flyteorg/flytestdlib@v1.0.4/bitarray/bitset.go:33
    <http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279ca70|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279ca70>, 0xc006688a50}, {0x27a8360?, 0xc00154d760?}, 0xc005fd3440, 0x23d7cd8)
    	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.5/go/tasks/plugins/array/core/metadata.go:33 +0x1e1
    <http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f33215a0ff0|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f33215a0ff0>, 0xc000a7e380}, {{0x278fa10, 0xc00174a0b0}}, {{0x278fa10, 0xc00174a160}}}, {0x279ca70, 0xc006688a50}, {0x27a8360, 0xc00154d760})
    	/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.5/go/tasks/plugins/array/k8s/executor.go:94 +0x225
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0>?, {0x279ca70, 0xc006688810}, {0x279ef58?, 0xc000bfe240?}, 0x0?)
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:382 +0x178
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279cdb8|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279cdb8>, 0xc000a5d8d8}, {0x278a5a8, 0xc0009b4aa0}, 0xc0009c7260, 0xc0009c7290, 0xc0009c72c0, {0x279e4d8, 0xc0007f8000}, 0xc0009fc000, ...}, ...)
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:384 +0x9a
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279cdb8|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279cdb8>, 0xc000a5d8d8}, {0x278a5a8, 0xc0009b4aa0}, 0xc0009c7260, 0xc0009c7290, 0xc0009c72c0, {0x279e4d8, 0xc0007f8000}, 0xc0009fc000, ...}, ...)
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:616 +0x182b
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x279fe08|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x279fe08>, 0xc0009f00d0}, {{0xc000c86160, {{...}, 0x0}, {0xc000484080, 0x4, 0x4}}, {0xc000c86180, {{...}, ...}, ...}, ...}, ...}, ...)
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x279fe08|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x279fe08>, 0xc0009f00d0}, {{0xc000c86160, {{...}, 0x0}, {0xc000484080, 0x4, 0x4}}, {0xc000c86180, {{...}, ...}, ...}, ...}, ...}, ...)
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:220 +0x9d0
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc0009e0000>, {0x279ca70, 0xc006688330}, {0x279e358, 0xc000734000}, 0xc00808dec0, {0x27b0d58?, 0xc008983110?})
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:382 +0x157
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc0009e0000>, {0x279ca70, 0xc006688330}, 0xc00808dec0, {0x279e358?, 0xc000734000?})
    	/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:512 +0x227
    <http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0009e0000|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc0009e0000>, {0x279ca70, 0xc006688330}, {0x2782f10, 0xc005f9cf00}, 0xc00808dec0, {0x279e358?, 0xc000734000})
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    @Kevin Su aws-batch or k8s-array plugin? This doesn't seem related to Mike's issue. but we should still resolve. Create an issue?
    Mike Zhong

    Mike Zhong

    1 month ago
    Hi @Kevin Su . Here is the log for one of the mapped tasks, unfortunately it’s not particularly helpful, we didn’t have the logger set so we don’t have an indication where in our task it failed, but we suspect it failed after it completed. I’m not sure what
    panic when reconciling workflow
    means but if you point me to what could throw that error, I could dig a little more
    that error you see is handled, it’s more of a warning that we are adding handlers to a non-root logger
    Kevin Su

    Kevin Su

    1 month ago
    I just created a PR to fix it, the problem is that tCtx.ow.GetReader() is nil when running map tasks with no output, and it causes nil pointer dereference panic. https://github.com/flyteorg/flytepropeller/pull/465
    Mike Zhong

    Mike Zhong

    1 month ago
    interesting root cause, I looked at our mapped out task and we do return an
    int
    . It’s not captured or used though. We ran into an issue where trying to use
    .with_overrides()
    with a
    map_task(task)
    where
    task
    returns nothing fails to compile with
    VoidPromise has no attribute with_overrides
    . We suspected there was something different between
    VoidPromise
    and
    Promise
    so we made sure all our map tasks returned something, even if it is just a sentinel value. I’d like to see if this fix resolves our issue
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    1 month ago
    @Mike Zhong it sounds like there may be another bug there with flytekit construction of map tasks. I'm not sure if
    with_overrides
    is currently supported, but it probably should be. @Kevin Su thanks for the backend fix, do you know anything about the flytekit side?
    Kevin Su

    Kevin Su

    1 month ago
    @Mike Zhong I just created a PR to support overriding the resource of voidPromise.https://github.com/flyteorg/flytekit/pull/1127