<@U06RKAMK0UC>, I am seeing these logs in flyte p...
# ask-ai
v
@ancient-wolf-19325, I am seeing these logs in flyte propeller service. In flyte workflow, one of the map_task pod is failed but workflow still in running state bcad-a282-4682-977d-cdb7021b9984-development/job-1dd5adf0-a168-416e-a807-d6092ade4513': workflow too large 2025-03-18T091804.759+05:30{"json":{"exec_id":"job-1dd5adf0-a168-416e-a807-d6092ade4513","ns":"kaas-v2-9bc1bcad-a282-4682-977d-cdb7021b9984-development","routine":"worker-16"},"level":"error","msg":"Failed storing workflow to the store, reason: workflow too large","ts":"2025-03-18T034804Z"} 2025-03-18T091804.797+05:30{"json":{"exec_id":"job-1dd5adf0-a168-416e-a807-d6092ade4513","ns":"kaas-v2-9bc1bcad-a282-4682-977d-cdb7021b9984-development","routine":"worker-16"},"level":"error","msg":"Failed recording a large workflow as failed, reason: workflow too large. Retrying...","ts":"2025-03-18T034804Z"} 2025-03-18T091804.797+05:30E0318 034804.797031 1 workers.go:103] error syncing 'kaas-v2-9bc1bcad-a282-4682-977d-cdb7021b9984-development/job-
j
The error message "workflow too large" indicates that the Flyte workflow's execution state has exceeded the size limit that Flyte can handle. This is a known issue when dealing with large workflows, especially those with extensive fan-out tasks or large data outputs. FlytePropeller attempts to store the workflow state in etcd, but if the state is too large, it fails to update, causing the workflow to remain in a running state. According to Flyte's documentation, when FlytePropeller encounters a workflow that is too large, it should ideally mark the workflow as failed after system retries. However, if the workflow remains in a running state, it suggests that the failure to update the workflow state is not being handled correctly, possibly due to the size of the workflow state exceeding the limits of the underlying storage system. Would you like more detailed guidance on how to address this issue or further investigate the workflow's state? Sources: - Flyte's documentation 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
v
@ancient-wolf-19325 yes
j
To address the issue of a Flyte workflow being stuck in a running state due to its size exceeding the storage limits, you can follow these steps: 1. Enable Offloading: Flyte provides a feature to offload large workflow states to an external storage system, which can help manage large workflows. You can enable this by setting the
useOffloadedWorkflowClosure
flag in your Flyte configuration. This feature offloads the workflow's static information from the Custom Resource Definition (CRD) to a blob store, reducing the size of the data stored in etcd.
Copy code
configmap:
     adminServer:
       flyteadmin:
         useOffloadedWorkflowClosure: true
This configuration should be applied to new workflows, as it does not affect already running workflows. You can find more details in the Flyte documentation. 2. Optimize Workflow Design: Consider restructuring your workflow to reduce its size. This can involve breaking down large workflows into smaller sub-workflows or using sub-launch plans. By doing so, each sub-workflow or launch plan is treated as a separate entity, which can help manage the overall size of the workflow state. 3. Monitor Workflow Size: Regularly monitor the size of your workflows and the data they produce. This can help you identify potential issues before they cause failures. You can use Flyte's monitoring tools and metrics to keep track of workflow sizes and execution states. 4. Handle Large Outputs: If your tasks produce large outputs, consider storing these outputs in an external storage system and only keeping references to them in the workflow state. This can significantly reduce the size of the data stored in etcd. 5. Increase etcd Limits: As a last resort, you can consider increasing the etcd size limits, but this is generally not recommended due to potential performance impacts. It's better to optimize the workflow design and use offloading features. By implementing these strategies, you can manage large workflows more effectively and prevent them from getting stuck in a running state due to size limitations. Sources: - Flyte issue on large workflows 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.
a
@victorious-jelly-28729 did the above work for you?
v
i havent used this, i found the actual problem in my code.