Hey, I've got a workflow where each task is relati...
# flyte-support
g
Hey, I've got a workflow where each task is relatively light in compute/data needs, but the DAG itself is heavy. I have a workflow inside a dynamic inside a dynamic. Specifically, my outer level is a dynamic which creates 50 dynamics, this middle level is a dynamic which each create 35 workflows. The inner workflow itself is fitting a relatively simple ML model (think XGBoost). When I call this, at larger scales I often
[1/1] currentAttempt done. Last Error: USER::Pod was rejected: The node had condition: [DiskPressure…
. I've tried bumping the disk space on the nodepool to something large, but this does not help. Using lower max-parallelism helps to some extent, but I'd like these to execute in parallel at scale. Is this a known issue with nested dynamics? Is there something I can improve in my flyte deployment? Is this something that won't be an issue in flyte 2.0? This post by @clean-glass-36808 about deserializing dynamic workflows massively increasing CPU usage is possibly related https://flyte-org.slack.com/archives/CP2HDHKE1/p1753231403202179
c
The issue that I ran into was purely on the Flyte Propeller (state machine) side of things. I don't think it caused disk pressure since I'm pretty sure everything is done in-memory with no buffering to disk.
g
Interesting. Is there any part of dynamic that would cause disk pressure? The tasks themselves read data in/out, but I've given each task plenty of resources (much more storage than the size of the data). But there might be something else on the flyte side that is writing to disk
f
It should not cause disk pressure at all
disk pressure is because of not using ephermeral storage and downloading a lot of data across many containers
I think your node or your kuberenetes node configuration may also be wrong or you have a very small root volume and using the root volume for containers - could be many things
Also @gorgeous-caravan-46442 I dont know if disk pressure can be improved, but your entire dynamics etc can be greatly greatly simplified with flyte 2
g
Hmm interesting. Given the workflow, it might be that I'm downloading a lot of data across many containers. Can you suggest what part of the cluster I should bump? See if that works
Great to hear flyte 2.0 simplifies it
f
You will have to see how your cluster and container disks are configured
I recommend using ephemeral storage specification
g
is there some specific kubectl setting I can check to see what is currently used?
f
sadly kubectl does not show disk utilization (even in kubectl top)
but sometimes it does - if you configure it
you can try
kubectl top <node-name>