We ve encountered some interesting production issues recentl Flyte #flyte-support

We've encountered some interesting production issu...

clean-glass-36808

07/23/2025, 12:43 AM

We've encountered some interesting production issues recently. We ramped up the number of workflows running concurrently in Flyte and we discovered that when those workflows included dynamic tasks our CPU usage dramatically increased. After doing some profiling it seems that 25% or more of the CPU time of propeller is involved in just deserializing dynamic workflow CRDs from the blob store. We found that this also introduced memory pressure on the system with frequent GC, presumably due to the ephemeral objects constantly getting created and thrown away. We found that you can enable an in-memory cache for the dynamic workflow CRDs but it doesn't really help because the cache is still binary so you still incur the cost of JSON deserialization on every workflow evaluation. We're considering building an in-memory LRU cache to avoid the repeated/duplicated deserializations but curious if anyone else has seen issues like this with dynamic workflows. I'm also wondering if we can just leverage sub launch plans as well since our etcD is not under much load at all.

freezing-airport-6809

07/23/2025, 5:02 AM

There is a deserialized protobuf cache right? or you can use memory pools

freezing-airport-6809

07/23/2025, 5:03 AM

adding one more lru cache will make it harder to maintain

clean-glass-36808

07/23/2025, 5:10 AM

This is the dynamic task workflow CRD so it’s JSON. There are in-memory byte array caches that I found backed by freecache. Didn’t really like the implementation and how you have to tune/hack GC to make the rest of go GC work correctly. Anyway, even with the in memory byte array cache you still do the cost of deserialization which is where the CPU is getting wrecked.

freezing-airport-6809

07/23/2025, 5:16 AM

What is a dynamic workflow CRD?

freezing-airport-6809

07/23/2025, 5:16 AM

its not json?

freezing-airport-6809

07/23/2025, 5:16 AM

its a protobuf

freezing-airport-6809

07/23/2025, 5:17 AM

how you have to tune/hack GC to make the rest of go GC work correctly

This is because the memory cache is very large. goGC did not handle very large memory reclaims well atleast till few years ago

👍 1

clean-glass-36808

07/23/2025, 5:20 AM

buildContextualDynamicWorkflow: https://github.com/flyteorg/flyte/blob/master/flytepropeller/pkg/controller/nodes/dynamic/dynamic_workflow.go#L130-L222 Which calls FutureFIleReader: https://github.com/flyteorg/flyte/blob/71b8f72d2304ff92a13acd84196e600b82faff06/flytepropeller/pkg/controller/nodes/task/future_file_reader.go#L[…]7 And this is pulling from the blob store, and deserializing binary into the workflow CRD

clean-glass-36808

07/23/2025, 5:23 AM

Screenshot 2025-07-22 at 10.22.29 PM.png

clean-glass-36808

07/23/2025, 5:24 AM

You can see

GetWorkflowCRD

is eating up almost half of our 24 CPU just deserializing workflows for these dynamic tasks.

clean-glass-36808

07/23/2025, 5:26 AM

Another peculiar thing is that we spent an unusual amount of time in TLS handshake server certificate verification. I'm guessing that has something to do with renegotiating TLS too often when pulling all this stuff from the blob store (despite HTTP keep alive being enabled by default). But that might be an issue with our blob store.

clean-glass-36808

07/23/2025, 5:26 AM

We also need to tune our GC, but yeah. JSON deserialization kills us

5 Views

Open in Slack

Previous Next