I found an interesting issue in which a branch nod...
# flyte-support
c
I found an interesting issue in which a branch node's task template seems to get corrupted. Looking for any ideas before I go down a rabbit hole. We have a workflow with 3 conditionals. One of the conditionals runs first and then the other two run in parallel. sim_result = n0 vizlog_path = n1 metric_result = n2
Copy code
# explicit DAG chaining is needed since vizlog conversion and metrics computation
    # are dependent on sim modality's file generation side effects
    sim_result >> vizlog_path
    sim_result >> metric_result
In rare cases we're seeing both
vizlog_path
and `metric_result`s inputs seemingly break and diverge from the task template. When
vizlog_path
and
metric_result
eventually run they hit
KeyErrors
because their literals and task input types don't match. This also crashes the UI when their nodes are visited. In the most recent case where this happened it looks like the workflow CRD informer cache went stale and errors are thrown when marking the
sim_result
node as successful.
vizlog_path
and
metric_result
run concurrently while this is happening.
Copy code
2025-05-29 21:18:43.179	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [Running] -> [Succeeded], (handler phase [Success])","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.643	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1/n1-n0","ns":"simulation-production","res_ver":"378182515","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Dynamic handler.Handle's called with phase 0.","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.696	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n2/n2-n0","ns":"simulation-production","res_ver":"378182515","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Dynamic handler.Handle's called with phase 0.","ts":"2025-05-30T04:18:43Z"}
...
2025-05-29 21:18:43.823	{"json":{"exec_id":"fqyaqtoaaz4s1a","ns":"simulation-production","routine":"worker-9073"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"fqyaqtoaaz4s1a\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [Running] -> [Succeeded], (handler phase [Success])","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Recording NodeEvent [node_id:\"n0\" execution_id:{project:\"simulation\" domain:\"production\" name:\"fqyaqtoaaz4s1a\"}] phase[SUCCEEDED]","ts":"2025-05-30T04:18:43Z"}
2025-05-29 21:18:43.874	{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n0","ns":"simulation-production","res_ver":"378182484","routine":"worker-902","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Failed to record node event [id:{node_id:\"n0\" ... with err: AlreadyExists: Event Already Exists, caused by [event has already been sent]","ts":"2025-05-30T04:18:43Z"}
...
2025-05-29 21:21:02.976	{"json":{"exec_id":"fqyaqtoaaz4s1a","ns": ... KeyError: '.do_vizlog'\\n\\nMessage:\\n\\n    KeyError: '.do_vizlog'\" kind:SYSTEM timestamp:{seconds:1748578849 nanos:175162315}]","ts":"2025-05-30T04:21:02Z"}
Related to https://github.com/flyteorg/flyte/issues/6441 Broken Case
Copy code
❯ protoc --decode_raw < inputs.pb 
1 {
  1: ".do_vizlog"
  2 {
    1 {
      1 {
        4: 1
      }
    }
  }
}
Working Case
Copy code
❯ protoc --decode_raw < inputs.pb 
1 {
  1: "params"
  2 {
    1 {
      7 {
        1 .....
        }
      }
    }
  }
}
1 {
  1: "task"
  2 {
    1 {
      1 {
        3: "convert_vizlog"
      }
    }
  }
}
As mentioned this happens rarely but at our scale seems to happen every day. When these workflows are relaunched and the system is not under as much load they complete fine.
Looks like the branch nodes went back to a queued state, likely because of the informer cache.
Copy code
2025-05-29 21:18:44.144	
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:44Z"}
I think the issue here is that the branch root node as well as the conditional node use the same S3 path for inputs. When the branch node gets reprocessed its probably rerunning the input data setup and overwriting the data in S3.
Root input written first time
Copy code
2025-05-29 21:18:43.291	
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:43Z"}
conditional node input written first time
Copy code
2025-05-29 21:18:43.541	
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1/n1-n0","ns":"simulation-production","res_ver":"378182512","routine":"worker-9073","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:43Z"}
Root node input written the second time
Copy code
2025-05-29 21:18:44.144	
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Change in node state detected from [NotYetStarted] -> [Queued]","ts":"2025-05-30T04:18:44Z"}

2025-05-29 21:18:44.145	
{"json":{"exec_id":"fqyaqtoaaz4s1a","node":"n1","ns":"simulation-production","res_ver":"378182509","routine":"worker-552","wf":"simulation:production:flyte.single_sim"},"level":"info","msg":"Node event phase: QUEUED, nodeId n1 already exist","ts":"2025-05-30T04:18:44Z"}
And since they share the same s3 bucket, boom
I think if the complete node ID were encoded into the input bucket this was be robust to this issue.
We're still on v1.14 🙂