We've encountered some interesting production issu...
# flyte-support
c
We've encountered some interesting production issues recently. We ramped up the number of workflows running concurrently in Flyte and we discovered that when those workflows included dynamic tasks our CPU usage dramatically increased. After doing some profiling it seems that 25% or more of the CPU time of propeller is involved in just deserializing dynamic workflow CRDs from the blob store. We found that this also introduced memory pressure on the system with frequent GC, presumably due to the ephemeral objects constantly getting created and thrown away. We found that you can enable an in-memory cache for the dynamic workflow CRDs but it doesn't really help because the cache is still binary so you still incur the cost of JSON deserialization on every workflow evaluation. We're considering building an in-memory LRU cache to avoid the repeated/duplicated deserializations but curious if anyone else has seen issues like this with dynamic workflows. I'm also wondering if we can just leverage sub launch plans as well since our etcD is not under much load at all.
f
There is a deserialized protobuf cache right? or you can use memory pools
adding one more lru cache will make it harder to maintain
c
This is the dynamic task workflow CRD so it’s JSON. There are in-memory byte array caches that I found backed by freecache. Didn’t really like the implementation and how you have to tune/hack GC to make the rest of go GC work correctly. Anyway, even with the in memory byte array cache you still do the cost of deserialization which is where the CPU is getting wrecked.
f
What is a dynamic workflow CRD?
its not json?
its a protobuf
how you have to tune/hack GC to make the rest of go GC work correctly
This is because the memory cache is very large. goGC did not handle very large memory reclaims well atleast till few years ago
👍 1
c
Screenshot 2025-07-22 at 10.22.29 PM.png
You can see
GetWorkflowCRD
is eating up almost half of our 24 CPU just deserializing workflows for these dynamic tasks.
Another peculiar thing is that we spent an unusual amount of time in TLS handshake server certificate verification. I'm guessing that has something to do with renegotiating TLS too often when pulling all this stuff from the blob store (despite HTTP keep alive being enabled by default). But that might be an issue with our blob store.
We also need to tune our GC, but yeah. JSON deserialization kills us