Hi all, We are using flyte-binary on EKS. When we ...
# ask-the-community
Hi all, We are using flyte-binary on EKS. When we run largish (~600 task) parallelized workflows, many tasks fail with what seems like a k8s versioning error: "Operation cannot be fulfilled on pods <pod_name>: the object has been modified; please apply your changes to the latest version and try again". The workflow is a dynamic workflow which spins off the large number of tasks. Looks like some kind of race condition when modifying the pod state, any idea why?
600is not large. We run famous of 5k ish
There are 2 things 1. Did you change your resources 2. Search for inject-finalizer in the doc
Hi @Ketan (kumare3), thank you for replying, what do you mean "did you change your resources"? There isn't any manual change in the resources or EKS.. Regarding inject-finalizer, I see that we did in fact have this set to "true", if we turn it off, what are the implications? The documentation is somewhat sparse, or perhaps I'm not looking in the right place.
I mean cpu / mem for the single binary
Yes, I did:
Copy code
      cpu:     4
      memory:  28Gi
      cpu:      2
      memory:   28Gi
Still getting "503 service unavailable" with these resources too I'm afraid 😕 Do you think these two issues are related?
hmm what, that is not right
503 seems to be coming from your proxy or router
503 is a specific error message, that is returned for quota exceeded / request rate exceeded
have you seen the cpu / mem usage for the pod? It might also be number of open threads etc
The cpu/mem looks to be OK, it's not reaching the resource limit. There's a gap in the metrics collection that's correlated with the 503 errors, which would hint that the metric scraping is also not getting through to the pod. The 503 error message in the logs has to do with the oauth token which is strange to me - isn't flyte authenticating against its own internal server in the same pod? This is the error message: E0609 082220.042972 7 workers.go:103] error syncing 'flytesnacks-development/azfmfpqb26mzkk8j4xlk': Workflow[] failed. ErrorRecordingError: failed to publish event, caused by: EventSinkError: Error sending event, caused by [rpc error: code = Unauthenticated desc = transport: per-RPC creds failed due to error: failed to get token: oauth2: cannot fetch token: 503 Service Unavailable
Ohh interesting. Thank you for sharing this. Do you use some oauth2 provider? I will talk to someone smarter about this
Yes, we use JumpCloud to authenticate to the console, but we haven't configured it to replace the server-to-server communication, we use the built-in flyte authentication for that.
Regarding inject-finalizer, I see that we did in fact have this set to "true", if we turn it off, what are the implications?
I think it's fine to leave it
especially if you need to handle high load in your cluster. It basically enables a tighter control on Pod's lifecycle, preventing Pods from being deleted until Flyte explicitly removes the finalizer. The error message you're getting is not the typical one when those situations happen though. Here is the worker who seems to find an stale ResourceVersion and then it's prevented from writing changes to the resource. There are ways to handle this, some of them covered in this page As per the
error getting the token, not sure if it refers to the ID Token that OIDC returns, which comes from your IdP in this case but it's a sign of high load somewhere
so it is indeed coming from the auth system