varsha Parthasarathy
11/15/2022, 3:21 PM@flytekit.task(
limits=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
requests=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
retries=1,
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(name="primary"),
],
node_selector={"<http://l5.lyft.com/pool|l5.lyft.com/pool>": "eks-pdx-pool-gpu"},
tolerations=[
V1Toleration(effect="NoSchedule", key="reserved", operator="Equal", value="gpu"),
],
),
primary_container_name="primary",
),
)
the above works as expected..@flytekit.task(
cache=True,
cache_version="0.0.0",
limits=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
requests=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
retries=1,
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(name="primary"),
],
node_selector={"<http://l5.lyft.com/pool|l5.lyft.com/pool>": "eks-pdx-pool-gpu"},
tolerations=[
V1Toleration(effect="NoSchedule", key="reserved", operator="Equal", value="gpu"),
],
),
primary_container_name="primary",
),
)
Workflow[avperceptionworkflows:dev:src.perception.scene_workflows.ground_truth.ground_truth_offline_workflows.GroundTruthOfflinePCPWorkFlow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: panic when executing a plugin [k8s-array]. Stack: [goroutine 982 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:375 +0xfe
panic({0x1f45600, 0x395a540})
/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)|github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)>
/go/pkg/mod/github.com/flyteorg/flytestdlib@v1.0.4/bitarray/bitset.go:33
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279cfd0|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279cfd0>, 0xc07cb88f60}, {0x27a88c0?, 0xc0887b6420?}, 0xc0d58fafc0, 0x23d8110)
/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.8/go/tasks/plugins/array/core/metadata.go:33 +0x1e1
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f1fa798ba18|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f1fa798ba18>, 0xc000b34a80}, {{0x278ff70, 0xc00054a8f0}}, {{0x278ff70, 0xc00054aa50}}}, {0x279cfd0, 0xc07cb88f60}, {0x27a88c0, 0xc0887b6420})
/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.8/go/tasks/plugins/array/k8s/executor.go:94 +0x225
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0>?, {0x279cfd0, 0xc07cb88d20}, {0x279f4b8?, 0xc000d17230?}, 0x0?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:382 +0x178
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279d318|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279d318>, 0xc00180e648}, {0x278ab08, 0xc000baea40}, 0xc000e76fc0, 0xc000e77050, 0xc000e77080, {0x279ea38, 0xc044cc7cc0}, 0xc000bf2580, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:384 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279d318|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279d318>, 0xc00180e648}, {0x278ab08, 0xc000baea40}, 0xc000e76fc0, 0xc000e77050, 0xc000e77080, {0x279ea38, 0xc044cc7cc0}, 0xc000bf2580, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:616 +0x182b
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x27a0368|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x27a0368>, 0xc001d6c1a0}, {{0xc000d1fbe0, {{...}, 0x0}, {0xc0012460c0, 0x4, 0x4}}, {0xc000d1fc00, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x27a0368|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x27a0368>, 0xc001d6c1a0}, {{0xc000d1fbe0, {{...}, 0x0}, {0xc0012460c0, 0x4, 0x4}}, {0xc000d1fc00, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:220 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, {0x279e8b8, 0xc000c2c000}, 0xc063448d80, {0x27b12b8?, 0xc0641ae0d0?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:382 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, 0xc063448d80, {0x279e8b8?, 0xc000c2c000?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:512 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, {0x2783470, 0xc0999a5400}, 0xc063448d80, {0x279e8b8?, 0xc000c2c000})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:736 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:934 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1>?, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400?}, {0x2783498?, 0xc0999a5400}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:774 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:941 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1>?, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400?}, {0x2783498?, 0xc0999a5400}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:774 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:941 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc00078e700|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc00078e700>, {0x279cfd0, 0xc07cb88240}, 0xc0999a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:147 +0x1b3
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc00078e700|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc00078e700>, {0x279cfd0, 0xc07cb88240}, 0xc0999a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:393 +0x40f
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000f6ba10>, {0x279cfd0, 0xc07cb88240}, 0xc0a4151848, 0x1e5a080?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:130 +0x18e
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000f6ba10>, {0x279cfd0, 0xc0954dbc50}, 0xc0999a4a00)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:131 +0x459
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000f6ba10>, {0x279cfd0, 0xc0954dbc50}, {0xc0b25c0318, 0x3}, {0xc0b25c031c, 0x14})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:205 +0x86d
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc001528cf0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc001528cf0>, 0xc0a4151f28, {0x1e5a080?, 0xc08a8d8bd0})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:88 +0x510
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc001528cf0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc001528cf0>, {0x279cfd0, 0xc0954dbc50})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:99 +0xf1
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x279cfd0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x279cfd0>?, {0x279cfd0, 0xc049036c00})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:115 +0xbd
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:150 +0x59
created by <http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run>
/go/src/github.co
Dan Rammer (hamersaw)
11/15/2022, 3:24 PMvarsha Parthasarathy
11/15/2022, 3:27 PMDan Rammer (hamersaw)
11/15/2022, 3:27 PMKetan (kumare3)
varsha Parthasarathy
11/15/2022, 3:28 PMdid this failure occur on the first execution or when using cached data?when using cache data.
how large is the maptask fanout (ie. how many input items)?6 subtasks for map task
Dan Rammer (hamersaw)
11/15/2022, 3:29 PMvarsha Parthasarathy
11/15/2022, 3:29 PMKetan (kumare3)
varsha Parthasarathy
11/15/2022, 3:31 PMKetan (kumare3)
varsha Parthasarathy
11/15/2022, 3:44 PMDan Rammer (hamersaw)
11/15/2022, 3:46 PMvarsha Parthasarathy
11/15/2022, 3:47 PMDan Rammer (hamersaw)
11/15/2022, 6:11 PMvarsha Parthasarathy
11/15/2022, 6:18 PMdatacatalog_version = "v1.0.1"
flyteadmin_version = "v1.1.21"
flyteconsole_version = "v1.1.0"
flytecopilot_version = "v0.0.26"
flytepropeller_version = "v1.1.21"
Dan Rammer (hamersaw)
11/15/2022, 6:26 PMAlex Pozimenko
11/17/2022, 11:05 PMKetan (kumare3)
Dan Rammer (hamersaw)
11/18/2022, 12:21 AMAlex Pozimenko
11/18/2022, 12:25 AMDan Rammer (hamersaw)
11/18/2022, 12:32 AMKetan (kumare3)
Alex Pozimenko
11/22/2022, 10:36 PMDan Rammer (hamersaw)
11/23/2022, 5:18 PMAlex Pozimenko
11/23/2022, 5:59 PMDan Rammer (hamersaw)
11/23/2022, 9:51 PMflyte propeller free workers count was at 0do you know how many workers you have propeller configured with? and then the round latency is pretty important here i think. were you able to find that prometheus metric?
Alex Pozimenko
11/23/2022, 11:56 PMdo you know how many workers you have propeller configured with?64 what is the round latency?
Dan Rammer (hamersaw)
11/24/2022, 12:18 AMAlex Pozimenko
11/24/2022, 12:20 AMDan Rammer (hamersaw)
11/24/2022, 12:21 AMsum(rate(flyte:propeller:all:round:raw_ms[5m])) by (wf)
.do I need to bump cpu/mem as well? right now we have 8CPU / 16G memory limilt (1 / 1G request)that's difficult to say. right now depending on configuration propeller can quite a bit of memory for things like caching workflow definitions, blobstore caching, etc. however, we aren't seeing issues with cpu utilization being high. thankfully increasing the number of workers has no effect (or very little) on memory utilization, so if 16G has been fine up until now, i would say your fine. i think 8 cpu would be plenty for 256 workers, but it might be something to keep an eye on.
Alex Pozimenko
11/24/2022, 12:24 AMDan Rammer (hamersaw)
11/24/2022, 12:26 AMi have propeller workflow acceptance latency and transition latency. These two should correlate with the round latency, right?also a difficult question 😅. theoretically, if the round latency is high then workers will take longer to process each workflow - so the acceptance latency and transition latencies will be high as well. however, this is only the case if the number of free workers hits zero. if you have workers available, then the round latency can be higher and the acceptance and transition latencies should not be significantly affected.
Alex Pozimenko
11/24/2022, 3:22 AMKetan (kumare3)
Alex Pozimenko
11/24/2022, 11:28 PMKetan (kumare3)
Alex Pozimenko
11/24/2022, 11:30 PMresync-interval
and/or controller-threads
may helpKetan (kumare3)