Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Dear Flyte community,

one difficulty I encounter when working with Flyte is the unavailability of the pod's local file system after an exception has occured (and the pod is completed). Is there any feature planned to mitigate this issue?
Some proposals from my side:
1. Keep pods with user errors alive (for a configurable duration) to facilitate manual debugging (with kubectl exec, etc.)
2. Create a snapshot of the pod's file system upon a user error (store in k8s or, probably better, in the configured storage backend)

You can keep pods around. Just adjust gc duration. Folks complained that we should delete the pods on error

Thanks for the info and the suggestion with adjusting the GC. I'll try that out.