sticky-art-97180
11/15/2022, 3:21 PMsticky-art-97180
11/15/2022, 3:21 PM@flytekit.task(
limits=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
requests=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
retries=1,
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(name="primary"),
],
node_selector={"<http://l5.lyft.com/pool|l5.lyft.com/pool>": "eks-pdx-pool-gpu"},
tolerations=[
V1Toleration(effect="NoSchedule", key="reserved", operator="Equal", value="gpu"),
],
),
primary_container_name="primary",
),
)
the above works as expected..sticky-art-97180
11/15/2022, 3:22 PM@flytekit.task(
cache=True,
cache_version="0.0.0",
limits=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
requests=flytekit.Resources(cpu="1", gpu="1", mem="70Gi"),
retries=1,
task_config=Pod(
pod_spec=V1PodSpec(
containers=[
V1Container(name="primary"),
],
node_selector={"<http://l5.lyft.com/pool|l5.lyft.com/pool>": "eks-pdx-pool-gpu"},
tolerations=[
V1Toleration(effect="NoSchedule", key="reserved", operator="Equal", value="gpu"),
],
),
primary_container_name="primary",
),
)
sticky-art-97180
11/15/2022, 3:23 PMWorkflow[avperceptionworkflows:dev:src.perception.scene_workflows.ground_truth.ground_truth_offline_workflows.GroundTruthOfflinePCPWorkFlow] failed. RuntimeExecutionError: max number of system retry attempts [51/50] exhausted. Last known status message: failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [k8s-array]: panic when executing a plugin [k8s-array]. Stack: [goroutine 982 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:375 +0xfe
panic({0x1f45600, 0x395a540})
/usr/local/go/src/runtime/panic.go:838 +0x207
<http://github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)|github.com/flyteorg/flytestdlib/bitarray.(*BitSet).IsSet(...)>
/go/pkg/mod/github.com/flyteorg/flytestdlib@v1.0.4/bitarray/bitset.go:33
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279cfd0|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/core.InitializeExternalResources({0x279cfd0>, 0xc07cb88f60}, {0x27a88c0?, 0xc0887b6420?}, 0xc0d58fafc0, 0x23d8110)
/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.8/go/tasks/plugins/array/core/metadata.go:33 +0x1e1
<http://github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f1fa798ba18|github.com/flyteorg/flyteplugins/go/tasks/plugins/array/k8s.Executor.Handle({{0x7f1fa798ba18>, 0xc000b34a80}, {{0x278ff70, 0xc00054a8f0}}, {{0x278ff70, 0xc00054aa50}}}, {0x279cfd0, 0xc07cb88f60}, {0x27a88c0, 0xc0887b6420})
/go/pkg/mod/github.com/flyteorg/flyteplugins@v1.0.8/go/tasks/plugins/array/k8s/executor.go:94 +0x225
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x0>?, {0x279cfd0, 0xc07cb88d20}, {0x279f4b8?, 0xc000d17230?}, 0x0?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:382 +0x178
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279d318|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x279d318>, 0xc00180e648}, {0x278ab08, 0xc000baea40}, 0xc000e76fc0, 0xc000e77050, 0xc000e77080, {0x279ea38, 0xc044cc7cc0}, 0xc000bf2580, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:384 +0x9a
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279d318|github.com/flyteorg/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x279d318>, 0xc00180e648}, {0x278ab08, 0xc000baea40}, 0xc000e76fc0, 0xc000e77050, 0xc000e77080, {0x279ea38, 0xc044cc7cc0}, 0xc000bf2580, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/task/handler.go:616 +0x182b
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x27a0368|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x27a0368>, 0xc001d6c1a0}, {{0xc000d1fbe0, {{...}, 0x0}, {0xc0012460c0, 0x4, 0x4}}, {0xc000d1fc00, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:70 +0xd8
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x27a0368|github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x27a0368>, 0xc001d6c1a0}, {{0xc000d1fbe0, {{...}, 0x0}, {0xc0012460c0, 0x4, 0x4}}, {0xc000d1fc00, {{...}, ...}, ...}, ...}, ...}, ...)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/dynamic/handler.go:220 +0x9d0
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, {0x279e8b8, 0xc000c2c000}, 0xc063448d80, {0x27b12b8?, 0xc0641ae0d0?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:382 +0x157
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, 0xc063448d80, {0x279e8b8?, 0xc000c2c000?})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:512 +0x227
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleNode(0xc001aaa540>, {0x279cfd0, 0xc07cb88690}, {0x2783470, 0xc0999a5400}, 0xc063448d80, {0x279e8b8?, 0xc000c2c000})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:736 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:934 +0x705
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1>?, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400?}, {0x2783498?, 0xc0999a5400}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:774 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:941 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleDownstream(0x22fdfc1>?, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400?}, {0x2783498?, 0xc0999a5400}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:774 +0x3c5
<http://github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540|github.com/flyteorg/flytepropeller/pkg/controller/nodes.(*nodeExecutor).RecursiveNodeHandler(0xc001aaa540>, {0x279cfd0, 0xc07cb88240}, {0x27ac5a8, 0xc0a5113e50}, {0x2783470, 0xc0999a5400}, {0x2783498?, 0xc0999a5400?}, {0x27a99d0, ...})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/nodes/executor.go:941 +0x935
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc00078e700|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0xc00078e700>, {0x279cfd0, 0xc07cb88240}, 0xc0999a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:147 +0x1b3
<http://github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc00078e700|github.com/flyteorg/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0xc00078e700>, {0x279cfd0, 0xc07cb88240}, 0xc0999a5400)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workflow/executor.go:393 +0x40f
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0xc000f6ba10>, {0x279cfd0, 0xc07cb88240}, 0xc0a4151848, 0x1e5a080?)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:130 +0x18e
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0xc000f6ba10>, {0x279cfd0, 0xc0954dbc50}, 0xc0999a4a00)
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:131 +0x459
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000f6ba10|github.com/flyteorg/flytepropeller/pkg/controller.(*Propeller).Handle(0xc000f6ba10>, {0x279cfd0, 0xc0954dbc50}, {0xc0b25c0318, 0x3}, {0xc0b25c031c, 0x14})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/handler.go:205 +0x86d
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc001528cf0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0xc001528cf0>, 0xc0a4151f28, {0x1e5a080?, 0xc08a8d8bd0})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:88 +0x510
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc001528cf0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0xc001528cf0>, {0x279cfd0, 0xc0954dbc50})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:99 +0xf1
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x279cfd0|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x279cfd0>?, {0x279cfd0, 0xc049036c00})
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:115 +0xbd
<http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()>
/go/src/github.com/flyteorg/flytepropeller/pkg/controller/workers.go:150 +0x59
created by <http://github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run|github.com/flyteorg/flytepropeller/pkg/controller.(*WorkerPool).Run>
/go/src/github.co
sticky-art-97180
11/15/2022, 3:23 PMhallowed-mouse-14616
11/15/2022, 3:24 PMsticky-art-97180
11/15/2022, 3:27 PMhallowed-mouse-14616
11/15/2022, 3:27 PMhallowed-mouse-14616
11/15/2022, 3:27 PMfreezing-airport-6809
freezing-airport-6809
sticky-art-97180
11/15/2022, 3:28 PMdid this failure occur on the first execution or when using cached data?when using cache data.
how large is the maptask fanout (ie. how many input items)?6 subtasks for map task
sticky-art-97180
11/15/2022, 3:29 PMhallowed-mouse-14616
11/15/2022, 3:29 PMsticky-art-97180
11/15/2022, 3:29 PMfreezing-airport-6809
freezing-airport-6809
sticky-art-97180
11/15/2022, 3:31 PMfreezing-airport-6809
freezing-airport-6809
sticky-art-97180
11/15/2022, 3:44 PMsticky-art-97180
11/15/2022, 3:44 PMhallowed-mouse-14616
11/15/2022, 3:46 PMhallowed-mouse-14616
11/15/2022, 3:46 PMhallowed-mouse-14616
11/15/2022, 3:47 PMsticky-art-97180
11/15/2022, 3:47 PMhallowed-mouse-14616
11/15/2022, 6:11 PMhallowed-mouse-14616
11/15/2022, 6:15 PMsticky-art-97180
11/15/2022, 6:18 PMdatacatalog_version = "v1.0.1"
flyteadmin_version = "v1.1.21"
flyteconsole_version = "v1.1.0"
flytecopilot_version = "v0.0.26"
flytepropeller_version = "v1.1.21"
sticky-art-97180
11/15/2022, 6:19 PMhallowed-mouse-14616
11/15/2022, 6:26 PMhallowed-mouse-14616
11/15/2022, 6:48 PMorange-hairdresser-63684
11/17/2022, 11:05 PMorange-hairdresser-63684
11/17/2022, 11:06 PMorange-hairdresser-63684
11/18/2022, 12:13 AMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
hallowed-mouse-14616
11/18/2022, 12:21 AMorange-hairdresser-63684
11/18/2022, 12:25 AMorange-hairdresser-63684
11/18/2022, 12:25 AMorange-hairdresser-63684
11/18/2022, 12:25 AMorange-hairdresser-63684
11/18/2022, 12:31 AMhallowed-mouse-14616
11/18/2022, 12:32 AMfreezing-airport-6809
freezing-airport-6809
orange-hairdresser-63684
11/22/2022, 10:36 PMorange-hairdresser-63684
11/22/2022, 10:57 PMorange-hairdresser-63684
11/22/2022, 10:57 PMorange-hairdresser-63684
11/22/2022, 11:14 PMhallowed-mouse-14616
11/23/2022, 5:18 PMhallowed-mouse-14616
11/23/2022, 5:19 PMorange-hairdresser-63684
11/23/2022, 5:59 PMorange-hairdresser-63684
11/23/2022, 6:00 PMorange-hairdresser-63684
11/23/2022, 6:01 PMhallowed-mouse-14616
11/23/2022, 9:51 PMflyte propeller free workers count was at 0do you know how many workers you have propeller configured with? and then the round latency is pretty important here i think. were you able to find that prometheus metric?
orange-hairdresser-63684
11/23/2022, 11:56 PMdo you know how many workers you have propeller configured with?64 what is the round latency?
hallowed-mouse-14616
11/24/2022, 12:18 AMorange-hairdresser-63684
11/24/2022, 12:20 AMhallowed-mouse-14616
11/24/2022, 12:21 AMsum(rate(flyte:propeller:all:round:raw_ms[5m])) by (wf)
.hallowed-mouse-14616
11/24/2022, 12:24 AMdo I need to bump cpu/mem as well? right now we have 8CPU / 16G memory limilt (1 / 1G request)that's difficult to say. right now depending on configuration propeller can quite a bit of memory for things like caching workflow definitions, blobstore caching, etc. however, we aren't seeing issues with cpu utilization being high. thankfully increasing the number of workers has no effect (or very little) on memory utilization, so if 16G has been fine up until now, i would say your fine. i think 8 cpu would be plenty for 256 workers, but it might be something to keep an eye on.
orange-hairdresser-63684
11/24/2022, 12:24 AMhallowed-mouse-14616
11/24/2022, 12:26 AMi have propeller workflow acceptance latency and transition latency. These two should correlate with the round latency, right?also a difficult question π . theoretically, if the round latency is high then workers will take longer to process each workflow - so the acceptance latency and transition latencies will be high as well. however, this is only the case if the number of free workers hits zero. if you have workers available, then the round latency can be higher and the acceptance and transition latencies should not be significantly affected.
orange-hairdresser-63684
11/24/2022, 3:22 AMfreezing-airport-6809
freezing-airport-6809
orange-hairdresser-63684
11/24/2022, 11:28 PMfreezing-airport-6809
freezing-airport-6809
orange-hairdresser-63684
11/24/2022, 11:30 PMorange-hairdresser-63684
11/24/2022, 11:31 PMorange-hairdresser-63684
11/24/2022, 11:31 PMresync-interval
and/or controller-threads
may helporange-hairdresser-63684
11/24/2022, 11:32 PMfreezing-airport-6809