Hi all We are using flyte binary on EKS When we run largish Flyte #flyte-support

Hi all, We are using flyte-binary on EKS. When we ...

tall-ram-83532

06/10/2024, 12:29 PM

Hi all, We are using flyte-binary on EKS. When we run largish (~600 task) parallelized workflows, many tasks fail with what seems like a k8s versioning error: "Operation cannot be fulfilled on pods <pod_name>: the object has been modified; please apply your changes to the latest version and try again". The workflow is a dynamic workflow which spins off the large number of tasks. Looks like some kind of race condition when modifying the pod state, any idea why?

freezing-airport-6809

06/10/2024, 1:28 PM

600is not large. We run famous of 5k ish

freezing-airport-6809

06/10/2024, 1:29 PM

There are 2 things 1. Did you change your resources 2. Search for inject-finalizer in the doc

tall-ram-83532

06/10/2024, 2:28 PM

Hi @freezing-airport-6809, thank you for replying, what do you mean "did you change your resources"? There isn't any manual change in the resources or EKS.. Regarding inject-finalizer, I see that we did in fact have this set to "true", if we turn it off, what are the implications? The documentation is somewhat sparse, or perhaps I'm not looking in the right place.

freezing-airport-6809

06/10/2024, 2:30 PM

I mean cpu / mem for the single binary

tall-ram-83532

06/10/2024, 2:31 PM

Yes, I did:

Copy code

Limits:
      cpu:     4
      memory:  28Gi
    Requests:
      cpu:      2
      memory:   28Gi

tall-ram-83532

06/10/2024, 2:35 PM

Still getting "503 service unavailable" with these resources too I'm afraid 😕 Do you think these two issues are related?

freezing-airport-6809

06/10/2024, 9:24 PM

hmm what, that is not right

freezing-airport-6809

06/10/2024, 9:24 PM

503 seems to be coming from your proxy or router

freezing-airport-6809

06/10/2024, 9:25 PM

503 is a specific error message, that is returned for quota exceeded / request rate exceeded

freezing-airport-6809

06/10/2024, 9:26 PM

have you seen the cpu / mem usage for the pod? It might also be number of open threads etc

tall-ram-83532

06/13/2024, 7:30 AM

The cpu/mem looks to be OK, it's not reaching the resource limit. There's a gap in the metrics collection that's correlated with the 503 errors, which would hint that the metric scraping is also not getting through to the pod. The 503 error message in the logs has to do with the oauth token which is strange to me - isn't flyte authenticating against its own internal server in the same pod? This is the error message: E0609 082220.042972 7 workers.go:103] error syncing 'flytesnacks-development/azfmfpqb26mzkk8j4xlk': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: failed to get token: oauth2: cannot fetch token: 503 Service Unavailable

freezing-airport-6809

06/13/2024, 2:08 PM

Ohh interesting. Thank you for sharing this. Do you use some oauth2 provider? I will talk to someone smarter about this

tall-ram-83532

06/13/2024, 2:43 PM

Yes, we use JumpCloud to authenticate to the console, but we haven't configured it to replace the server-to-server communication, we use the built-in flyte authentication for that.

average-finland-92144

06/13/2024, 9:02 PM

Regarding inject-finalizer, I see that we did in fact have this set to "true", if we turn it off, what are the implications?

I think it's fine to leave it

True

especially if you need to handle high load in your cluster. It basically enables a tighter control on Pod's lifecycle, preventing Pods from being deleted until Flyte explicitly removes the finalizer. The error message you're getting is not the typical one when those situations happen though. Here is the worker who seems to find an stale ResourceVersion and then it's prevented from writing changes to the resource. There are ways to handle this, some of them covered in this page As per the

error getting the token, not sure if it refers to the ID Token that OIDC returns, which comes from your IdP in this case but it's a sign of high load somewhere

freezing-airport-6809

06/14/2024, 5:37 AM

so it is indeed coming from the auth system

26 Views

Open in Slack

Previous Next