< ancient wolf 19325> in flyte 1 14 do we have any configura Flyte #ask-ai

<@U06RKAMK0UC>, in flyte 1.14 do we have any conf...

victorious-jelly-28729

09/04/2025, 11:26 AM

@ancient-wolf-19325, in flyte 1.14 do we have any configuration that if task pod image is not pulled with in 1min, pod kills automatically ?

ancient-wolf-19325

09/04/2025, 11:26 AM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look up the Flyte 1.14 documentation to see if there's a setting that lets you automatically kill a pod if its image isn't pulled within a minute. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 56 documents that could be relevant from the following data sources: Github File, Github Issue, Web Content. • Filtered down to the top 30 most relevant documents. • I'll look for official Flyte 1.14 documentation or discussions to see if it's safe and supported to set the pod pending timeout to 1 minute, and whether it can be applied only to image pull delays. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 51 documents that could be relevant from the following data sources: Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 23 most relevant documents. • I'll do one more targeted search to see if Flyte's official documentation or community discussions mention whether it's safe or supported to set the pod pending timeout to just 1 minute, and if there are any risks or recommendations about doing so. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 54 documents that could be relevant from the following data sources: Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 25 most relevant documents. ✍️ Generated answer: I've found that Flyte 1.14 lets you set a timeout for how long a pod can stay in the Pending state (including waiting for an image pull), but there's no official guidance on whether setting this timeout to just 1 minute is supported or safe. I'll explain what is possible and what to watch out for.

ancient-wolf-19325

09/04/2025, 11:28 AM

It seems likely that Flyte 1.14 supports a configuration to automatically fail pods that remain in the Pending state (such as when an image is not pulled) for longer than a specified timeout. This is controlled by the pod-pending-timeout setting in the Kubernetes plugin configuration. If a pod is stuck in Pending (e.g., due to image pull delays) beyond the configured timeout (e.g., 1 minute), FlytePropeller will mark the task as failed and clean up the pod. By default, this timeout is 0 (disabled), but you can set it to 1m to achieve your desired behavior. You would configure this by setting plugins -> k8s -> pod-pending-timeout: 1m in your Flyte configuration (e.g., flyte_single_binary.yaml). This feature was introduced to address exactly the scenario you described, and is documented in recent Flyte releases and code changes. Would you like more detail or a sample configuration snippet? Sources: - Add PodPendingTimeout config to fail pods stuck in pending - Feature/add pod pending timeout config 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 11:29 AM

@ancient-wolf-19325, No i mean we are seeing an issue that task/pod when it take time to pull the image and once it becomes running state it immediately getting killed.

ancient-wolf-19325

09/04/2025, 11:29 AM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look for Flyte documentation or known issues about pods being killed right after they start running, especially if the image pull took a long time. This will help us understand why this is happening and how to fix it. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 56 documents that could be relevant from the following data sources: Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents.

ancient-wolf-19325

09/04/2025, 11:30 AM

To help diagnose why your pod is being killed right after it starts running, could you please provide the pod's status and the reason for termination? You can get this information by running 'kubectl describe pod <pod-name>' and sharing the relevant output.Can you provide the pod's status and the reason for termination (e.g., from 'kubectl describe pod <pod-name>') for a pod that was killed immediately after entering the Running state? 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 12:14 PM

@ancient-wolf-19325, apiVersion: v1 kind: ConfigMap metadata: name: flyte-propeller-config namespace: duploservices-xforge-qa uid: c6a91d18-39ae-4fca-9e92-3b45e01e5745 resourceVersion: '148221669' creationTimestamp: '2025-07-22T211304Z' labels: app.kubernetes.io/instance: flyte app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: flyteadmin helm.sh/chart: flyte-core-v1.14.1 k8slens-edit-resource-version: v1 annotations: meta.helm.sh/release-name: flyte meta.helm.sh/release-namespace: duploservices-xforge-qa selfLink: /api/v1/namespaces/duploservices-xforge-qa/configmaps/flyte-propeller-config spec: {} data: admin.yaml: | admin: clientId: 'f918e381-c854-4566-9b6c-1d99d1d7874d' clientSecretLocation: /etc/secrets/client_secret endpoint: flyteadmin:81 insecure: true scopes: - api://f918e381-c854-4566-9b6c-1d99d1d7874d/.default useAudienceFromAdmin: true event: capacity: 1000 rate: 500 type: admin catalog.yaml: | catalog-cache: endpoint: datacatalog:89 insecure: true type: datacatalog copilot.yaml: | plugins: k8s: co-pilot: image: cr.flyte.org/flyteorg/flytecopilot-release:v1.12.0 name: flyte-copilot- start-timeout: 300s default-pod-template-name: flyte-imagepullsecrets-template core.yaml: | manager: pod-application: flytepropeller pod-template-container-name: flytepropeller pod-template-name: flytepropeller-template propeller: downstream-eval-duration: 30s enable-admin-launcher: true gc-interval: 12h kube-client-config: burst: 25 qps: 100 timeout: 30s leader-election: enabled: true lease-duration: 15s lock-config-map: name: propeller-leader namespace: duploservices-xforge-qa renew-deadline: 10s retry-period: 2s limit-namespace: all literal-offloading-config: enabled: true max-workflow-retries: 50 metadata-prefix: metadata/propeller metrics-prefix: flyte prof-port: 10254 queue: batch-size: -1 batching-interval: 2s queue: base-delay: 5s capacity: 1000 max-delay: 120s rate: 100 type: maxof sub-queue: capacity: 1000 rate: 100 type: bucket type: batch rawoutput-prefix: s3://duploservices-xforge-qa-flyte-089910700032/ workers: 40 workflow-reeval-duration: 30s webhook: certDir: /etc/webhook/certs serviceName: flyte-pod-webhook enabled_plugins.yaml: | tasks: task-plugins: default-for-task-types: container: container container_array: k8s-array sidecar: sidecar enabled-plugins: - container - sidecar - k8s-array - agent-service k8s.yaml: | plugins: k8s: default-cpus: 100m default-env-vars: [] default-memory: 100Mi resource_manager.yaml: | propeller: resourcemanager: type: noop storage.yaml: | storage: type: s3 container: "duploservices-xforge-qa-flyte-089910700032" connection: auth-type: iam region: us-west-2 enable-multicontainer: false limits: maxDownloadMBs: 1000 cache: max_size_mbs: 0 target_gc_percent: 70 task_logs.yaml: | plugins: logs: cloudwatch-enabled: false kubernetes-enabled: true kubernetes-template-uri: http://localhost:30082/#/log///pod?namespace= what is limit-namespace: all here ??

ancient-wolf-19325

09/04/2025, 12:14 PM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look up the official FlytePropeller documentation to confirm exactly what 'limit-namespace: all' means and how it affects Flyte's behavior. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 54 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have confirmed what 'limit-namespace: all' means in your FlytePropeller config and can provide a clear explanation.

ancient-wolf-19325

09/04/2025, 12:15 PM

In your Flyte Propeller config,

limit-namespace: all

means that FlytePropeller is configured to watch and manage workflows across all Kubernetes namespaces, not just a single namespace. This allows Flyte to orchestrate tasks and workflows in any namespace within the cluster, rather than being restricted to a specific one. This setting is standard for multi-tenant or multi-project Flyte deployments, but you can restrict FlytePropeller to a single namespace by setting

limit-namespace

to a specific namespace name instead of

all

(source, source). Would you like more detail on namespace scoping or multi-namespace support? Sources: - propeller-config.yaml - https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/README.md 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 12:16 PM

@ancient-wolf-19325, can i set to single namespace like "duploservices-xforge-qa" where flyte is installed ?

ancient-wolf-19325

09/04/2025, 12:16 PM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll check the official Flyte documentation to confirm if you can set 'limit-namespace' to a specific namespace, and if there are any special considerations for doing this. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 52 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have confirmed that you can set FlytePropeller's limit-namespace to a single namespace like 'duploservices-xforge-qa' to restrict its operation to just that namespace.

ancient-wolf-19325

09/04/2025, 12:16 PM

Yes, you can set

limit-namespace: duploservices-xforge-qa

to restrict FlytePropeller to only manage workflows in the

duploservices-xforge-qa

namespace. This limits Flyte's orchestration to that namespace only, rather than all namespaces (source, source). Would you like more detail on how to update this setting or its effects? Sources: - https://github.com/flyteorg/flyte/blob/master/charts/flyte-core/README.md - https://github.com/flyteorg/flyte/blob/master/flytepropeller/propeller-config.yaml 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 12:37 PM

@ancient-wolf-19325, Workflow execution not found in flyteadmin

ancient-wolf-19325

09/04/2025, 12:37 PM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look up Flyte documentation and known issues to find out why a workflow execution might not be found in flyteadmin and what steps you can take to troubleshoot this. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 61 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have found the main reasons why you might see 'Workflow execution not found in flyteadmin' and can provide clear troubleshooting steps.

ancient-wolf-19325

09/04/2025, 12:38 PM

The error "Workflow execution not found in flyteadmin" means that FlytePropeller or another component is trying to process a workflow execution that does not exist in the FlyteAdmin database. This can happen if the FlyteWorkflow CRD exists in Kubernetes, but the corresponding execution record is missing in FlyteAdmin, often due to a race condition, database reset, or manual deletion (source, source). To resolve this, ensure that your database and CRDs are in sync. If you have recently reset the database or cluster, you may need to manually delete orphaned FlyteWorkflow CRDs from Kubernetes. Would you like more detail on troubleshooting or cleanup steps? Sources: - https://github.com/flyteorg/flyte/issues/1048 - https://github.com/flyteorg/flyte/issues/2275 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 12:40 PM

@ancient-wolf-19325, We installed flyte in multiple namespaces with in a single cluster. sometimes when workflow starts n0 works, n1 works and n2 fails with "Workflow execution not found in flyteadmin" How to ensure that your database and CRDs are in sync ??

ancient-wolf-19325

09/04/2025, 12:40 PM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look for official Flyte documentation or community advice on how to make sure the FlyteAdmin database and Kubernetes CRDs stay in sync, especially when you have multiple Flyte installations in different namespaces. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 56 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have found that this is a known race condition in Flyte when running multiple installations in the same cluster. I can explain why it happens and what steps you can take to reduce or avoid the issue.

ancient-wolf-19325

09/04/2025, 12:41 PM

This error occurs when a FlyteWorkflow CRD exists in Kubernetes, but the corresponding execution record is missing in FlyteAdmin, often due to race conditions or misconfiguration in multi-namespace setups. To ensure database and CRDs are in sync, you must avoid running multiple Flyte deployments (with separate FlyteAdmin instances) that share the same database or have overlapping namespace management, as this can cause conflicts and orphaned CRDs (source, source). Best practice is to run a single FlyteAdmin instance per database and ensure that only one FlytePropeller manages a given namespace. If you encounter orphaned CRDs, manually delete them from Kubernetes. Also, avoid resetting the database without cleaning up CRDs, and review your deployment to ensure each Flyte installation is isolated by namespace and database. Would you like a step-by-step guide for cleanup or deployment best practices? Sources: - https://github.com/flyteorg/flyte/issues/1048 - https://github.com/flyteorg/flyte/issues/2275 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

victorious-jelly-28729

09/04/2025, 12:43 PM

@ancient-wolf-19325, we have single FlyteAdmin instance per database/namespace and one FlytePropeller manages a given namespace via limit-namespace: {{namespace name}}

ancient-wolf-19325

09/04/2025, 12:43 PM

Reasoning Complete! 🤖 🔍 Searched knowledge base: I'll look for Flyte documentation or community advice about running one FlyteAdmin and one FlytePropeller per namespace, to see if there are any known issues or extra steps needed to keep everything in sync. • Searching over 10 data sources from the following: File, Github File, Github Issue, Web • Found 53 documents that could be relevant from the following data sources: Bing, Github File, Github Issue, Github Pull Request, Web Content. • Filtered down to the top 30 most relevant documents. ✍️ Generated answer: I have confirmed that your deployment pattern is supported, but there are known race conditions in Flyte that can cause 'ExecutionNotFound' errors even with this setup. I can explain why this happens and what you can do about it.

ancient-wolf-19325

09/04/2025, 12:44 PM

If you have a single FlyteAdmin instance per database/namespace and one FlytePropeller per namespace (using

limit-namespace: <namespace>

), but still see "Workflow execution not found in flyteadmin," it may be due to a race condition where the FlyteWorkflow CRD is created before the execution is fully registered in FlyteAdmin. This is a known issue in Flyte, especially under high load or with multiple deployments (source, source). To mitigate, ensure that FlyteAdmin and FlytePropeller are not restarted or upgraded independently during workflow creation, and consider enabling exponential backoff in FlytePropeller to reduce the chance of this race. If the issue persists, check for orphaned CRDs and clean them up manually. Would you like more detail on the race condition or configuration tips to reduce its impact? Sources: - https://github.com/flyteorg/flyte/issues/1048 - https://github.com/flyteorg/flyte/issues/2275 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

7 Views

Open in Slack

Previous Next