Hi everyone! I’m working on a multi-step workflow and have noticed that my Flyte pods wait about 30-45s between the end of one and the beginning of another. This wouldn’t be an inconvenience when running a few steps but it adds up when trying to iterate on workflows of 20+ tasks. The time seems to be spent while the previous pod is still running, before the new one spins up. I’ve confirmed that my code is only taking a few seconds so AFAICT, the delay is due to something happening in Flyte iteself. Does anyone have insight into what’s going on here and if/how I can speed this up?
04/03/2023, 11:14 PM
Many factors cause the delay. k8s is pulling the image, or launching and scheduling a pod. flytekit will download / upload task inputs / outputs. we’re working a feature that exposes the runtime metrics in the system. you will get more insights about the system. stay tune. propeller are evaluating the nodes.
For now, there are some ways to speed it up.
1. increase pod memory
2. smaller image. (less time to download)
3. use Non-fast register
4. use fsspec plugin to upload/download the data. (latest flytekit use fsspec by default)
04/04/2023, 4:10 PM
Thanks for the detailed reply.
Can you elaborate on how more pod memory would help and how much is ideal? It’s not image download time in my scenario since this image is already available on the node and I see the pod immediately enter Running without spending time in ContainerCreating. And I am running v1.4.3. Using non-fast register looks promising though! I’ll try that.
04/04/2023, 5:46 PM
Typically, I’ll set memory to 500~800MB. Sometimes it took few seconds to load python module if you have low memory.