clean-glass-36808
05/30/2025, 11:18 PM# explicit DAG chaining is needed since vizlog conversion and metrics computation
# are dependent on sim modality's file generation side effects
sim_result >> vizlog_path
sim_result >> metric_result
In rare cases we're seeing both vizlog_path
and `metric_result`s inputs seemingly break and diverge from the task template. When vizlog_path
and metric_result
eventually run they hit KeyErrors
because their literals and task input types don't match. This also crashes the UI when their nodes are visited.
In the most recent case where this happened it looks like the workflow CRD informer cache went stale and errors are thrown when marking the sim_result
node as successful. vizlog_path
and metric_result
run concurrently while this is happening.
2025-05-29 21:18:43.179 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [Running] -> [Succeeded], (handler phase [Success])","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.643 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1/n1-n0","ns":"simulation-production","res_ver":"378182515","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Dynamic handler.Handle's called with phase 0.","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.696 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n2/n2-n0","ns":"simulation-production","res_ver":"378182515","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Dynamic handler.Handle's called with phase 0.","ts":"2025-05-30T04:18:43Z"}
...
2025-05-29 21:18:43.823 {"json":{"exec_id":"fqyaqtoaaz4s1a","ns":"simulation-production","routine":"worker-9073"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"fqyaqtoaaz4s1a\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [Running] -> [Succeeded], (handler phase [Success])","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Recording NodeEvent [node_id:\"n0\" execution_id:{project:\"simulation\" domain:\"production\" name:\"fqyaqtoaaz4s1a\"}] phase[SUCCEEDED]","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874 {"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Failed to record node event [id:{node_id:\"n0\" ... with err: AlreadyExists: Event Already Exists, caused by [event has already been sent]","ts":"2025-05-30T04:18:43Z"}
...
2025-05-29 21:21:02.976 {"json":{"exec_id":"fqyaqtoaaz4s1a","ns": ... KeyError: '.do_vizlog'\\n\\nMessage:\\n\\n KeyError: '.do_vizlog'\" kind:SYSTEM timestamp:{seconds:1748578849 nanos:175162315}]","ts":"2025-05-30T04:21:02Z"}
Related to https://github.com/flyteorg/flyte/issues/6441
Broken Case
❯ protoc --decode_raw < inputs.pb
1 {
1: ".do_vizlog"
2 {
1 {
1 {
4: 1
}
}
}
}
Working Case
❯ protoc --decode_raw < inputs.pb
1 {
1: "params"
2 {
1 {
7 {
1 .....
}
}
}
}
}
1 {
1: "task"
2 {
1 {
1 {
3: "convert_vizlog"
}
}
}
}
clean-glass-36808
05/31/2025, 12:10 AMclean-glass-36808
05/31/2025, 12:54 AM2025-05-29 21:18:44.144
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:44Z"}
I think the issue here is that the branch root node as well as the conditional node use the same S3 path for inputs.
When the branch node gets reprocessed its probably rerunning the input data setup and overwriting the data in S3.clean-glass-36808
05/31/2025, 1:03 AM2025-05-29 21:18:43.291
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:43Z"}
conditional node input written first time
2025-05-29 21:18:43.541
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1/n1-n0","ns":"simulation-production","res_ver":"378182512","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:43Z"}
Root node input written the second time
2025-05-29 21:18:44.144
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:44Z"}
2025-05-29 21:18:44.145
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Node event phase: QUEUED, nodeId n1 already exist","ts":"2025-05-30T04:18:44Z"}
And since they share the same s3 bucket, boomclean-glass-36808
05/31/2025, 1:05 AMclean-glass-36808
05/31/2025, 1:18 AMclean-glass-36808
05/31/2025, 1:18 AM