Hello everyone I am trying to deploy Flyte Binary on K8s usi Flyte #flyte-support

Hello everyone, I am trying to deploy Flyte-Binary...

fancy-hamburger-89099

05/23/2024, 6:53 AM

Hello everyone, I am trying to deploy Flyte-Binary on K8s, using the official Helm chart, and I was able to deploy it, and it gets into a running state, but the pod keeps on restarting, with a 137 error code(probably an OOMError). I tried to google the issue but even after trying some suggested solutions, I still cannot make it work. The logs always end on this

Copy code

2024/05/23 06:44:57 /flyteorg/build/datacatalog/pkg/repositories/handle.go:79
[2.518ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-dev-flyte-binary-webhook","servicePort":443,"secretName":"flyte-dev-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"<http://gcr.io/google.com/cloudsdktool/cloud-sdk:alpine|gcr.io/google.com/cloudsdktool/cloud-sdk:alpine>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}

Snippet from kubectl describe

Copy code

Containers:
  flyte:
    Container ID:  <containerd://3e4eff2d1ae0be9928d7507b421c1105056fa94e833a690c81aa1bcf7ec7ae3>f
    Image:         <http://cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0|cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0>
    Image ID:      <http://cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6|cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6>
    Ports:         8088/TCP, 8089/TCP, 9443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      start
      --config
      /etc/flyte/config.d/*.yaml
    State:          Running
      Started:      Thu, 23 May 2024 08:51:18 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 23 May 2024 08:44:56 +0200
      Finished:     Thu, 23 May 2024 08:46:16 +0200
    Ready:          False
    Restart Count:  18
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:      100m
      memory:   500Mi

And these are my Helm values

Copy code

flyte-binary:
  deployment:
    resources:
      requests:
        cpu: 100m
        memory: 500Mi
      limits:
        cpu: 1
        memory: 2Gi
  configuration:
    inline:
      task_resources:
        defaults:
          cpu: 100m
          memory: 100Mi
          storage: 100Mi
        limits:
          memory: 1Gi
    storage:
      metadataContainer: "flyte"
      userDataContainer: "flyte"
      providerConfig:
        s3:
          # v2Signing Flag to sign requests with v2 signature
          # Useful for s3-compatible blob stores (e.g. minio)
          v2Signing: true
          # authType Type of authentication to use for connecting to S3-compatible service (Supported values: iam, accesskey)
          authType: "accesskey"

    auth:
      enabled: true
      oidc:
        baseUrl: <https://login.microsoftonline.com/xxx/oauth2/v2.0/authorize>
        clientId: xxx

  ingress:
    create: true
    # commonAnnotations Add common annotations to all ingress resources
    commonAnnotations:
      <http://cert-manager.io/cluster-issuer|cert-manager.io/cluster-issuer>: letsencrypt-dns01
    # httpAnnotations Add annotations to http ingress resource
    httpAnnotations: {}
    # grpcAnnotations Add annotations to grpc ingress resource
    grpcAnnotations: {}
    ingressClassName: "nginx-internal"

On top of these, I also apply other values that are environment specific

Copy code

flyte-binary:
  configuration:
    database:
      username: flyte_dev
      host: xxx
      dbname: flyte_dev
    storage:
      providerConfig:
        s3:
          endpoint: "minio-api.xxx.cloud"
          accessKey: "xxx"
    auth:
      authorizedUris:
      - <https://flyte.xxx.cloud>
  ingress:
    host: "flyte.xxx.cloud"
    tls:
      - hosts:
          - flyte.xxx.cloud
        secretName: flyte-dev-app-certificate

And

Copy code

flyte-binary:
  configuration:
    database:
      password: "xxx"
    storage:
      providerConfig:
        s3:
          secretKey: "xxx"
    auth:
      oidc:
        clientSecret: xxx
      internal:
        clientSecret: 'xxx'
        clientSecretHash: "xxx"

Can anyone assist me with this? Thank you very much!

clean-glass-36808

05/23/2024, 3:18 PM

Have you tried increasing the memory limits?

fancy-hamburger-89099

05/23/2024, 3:19 PM

@clean-glass-36808 I haven't, but originally, I did not have any limit set, and it still was not working.

clean-glass-36808

05/23/2024, 3:21 PM

If you don’t have limits set your pod can still be OOMKilled due to memory pressure on the node?

fancy-hamburger-89099

05/23/2024, 3:22 PM

I think we have quite a bit of overhead on the memory side, so I am not sure if that could be the case, however setting the limit higher might be worth try

clean-glass-36808

05/23/2024, 5:25 PM

I think it might not be OOM actually because I think the reason would state OOM. 137 is caused by a SIGKILL so it may also be the pod failing readiness? I wonder what the container status shows

gratitude thank you 1

fancy-hamburger-89099

05/24/2024, 5:34 AM

@clean-glass-36808 I tried to raise both limits and requests and it didn't help. I also think it's not OOM but I cannot find anything that would indicate what is wrong. Probes are indeed failing but I don't know why would that be, I am using pretty standard configuration

Copy code

Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  8m                      default-scheduler  Successfully assigned devops-dev/flyte-nl-main-dev-dev-flyte-binary-86989b4b7b-7h2zg to aks-nodepool-18165302-vmss0000hg
  Normal   Pulling    8m                      kubelet            Pulling image "postgres:15-alpine"
  Normal   Pulled     7m54s                   kubelet            Successfully pulled image "postgres:15-alpine" in 6.061s (6.061s including waiting)
  Normal   Created    7m54s                   kubelet            Created container wait-for-db
  Normal   Started    7m54s                   kubelet            Started container wait-for-db
  Normal   Pulling    7m51s                   kubelet            Pulling image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0"
  Normal   Pulled     7m47s                   kubelet            Successfully pulled image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0" in 4.747s (4.747s including waiting)
  Normal   Created    7m47s                   kubelet            Created container gen-admin-auth-secret
  Normal   Started    7m47s                   kubelet            Started container gen-admin-auth-secret
  Warning  Unhealthy  6m51s (x3 over 7m11s)   kubelet            Liveness probe failed: Get "<http://10.64.10.92:8088/healthcheck>": dial tcp 10.64.10.92:8088: connect: connection refused
  Normal   Killing    6m51s                   kubelet            Container flyte failed liveness probe, will be restarted
  Normal   Pulled     6m21s (x2 over 7m43s)   kubelet            Container image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0" already present on machine
  Normal   Created    6m21s (x2 over 7m43s)   kubelet            Created container flyte
  Normal   Started    6m20s (x2 over 7m43s)   kubelet            Started container flyte
  Warning  Unhealthy  2m41s (x22 over 7m11s)  kubelet            Readiness probe failed: Get "<http://10.64.10.92:8088/healthcheck>": dial tcp 10.64.10.92:8088: connect: connection refused

And these are the Container Statuses

Copy code

containerStatuses:
  - containerID: <containerd://47e2d84bdb012da83b2e5639293bcf3a90f91cd16a3340846865211baa14666>0
    image: cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0
    imageID: cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6
    lastState:
      terminated:
        containerID: <containerd://849c13aa05c244f12a8c0f677f49e1c481f9f5012d0b548522ea8282ccc31d6>0
        exitCode: 137
        finishedAt: "2024-05-24T05:32:47Z"
        reason: Error
        startedAt: "2024-05-24T05:31:18Z"
    name: flyte
    ready: false
    restartCount: 6
    started: true
    state:
      running:
        startedAt: "2024-05-24T05:32:48Z"
  hostIP: 10.41.0.11
  initContainerStatuses:
  - containerID: <containerd://9538ffbd902f3c5e90df2c2d30dc5be64918c95d3bba9b70d086812372b61f5>d
    image: docker.io/library/postgres:15-alpine
    imageID: docker.io/library/postgres@sha256:0cec11eaf51a9af24c27a09cae9840a9234336e5bf9edc5fdf67b3174ba05210
    lastState: {}
    name: wait-for-db
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://9538ffbd902f3c5e90df2c2d30dc5be64918c95d3bba9b70d086812372b61f5>d
        exitCode: 0
        finishedAt: "2024-05-24T05:23:44Z"
        reason: Completed
        startedAt: "2024-05-24T05:23:44Z"
  - containerID: <containerd://11511596ee8c8655201ad3e6f3e1c3218f8123eaf68ffc4b01ccc31f1ecb0c9>6
    image: cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0
    imageID: cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6
    lastState: {}
    name: gen-admin-auth-secret
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://11511596ee8c8655201ad3e6f3e1c3218f8123eaf68ffc4b01ccc31f1ecb0c9>6
        exitCode: 0
        finishedAt: "2024-05-24T05:23:54Z"
        reason: Completed
        startedAt: "2024-05-24T05:23:51Z"

But I only see the 137 error. Edit: I tried to extend the delay period of the probe to 5 minutes, and the pod is still stuck with this thing as a last record in the log

Copy code

2024/05/24 05:36:20 /flyteorg/build/datacatalog/pkg/repositories/handle.go:79
[1.147ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-nl-main-dev-dev-flyte-binary-webhook","servicePort":443,"secretName":"flyte-nl-main-dev-dev-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"gcr.io/google.com/cloudsdktool/cloud-sdk:alpine","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}

Is there a way how to get more information about what could be wrong?

clean-glass-36808

05/24/2024, 5:58 AM

Well whatever server that is supposed to run on that image isn’t binding to port 8088 it seems. Server might be blocked on some initialization.

clean-glass-36808

05/24/2024, 6:05 AM

Maybe try increasing the log levels in the configuration to see what the pod is doing. There is an option for the log level in the helm chart.

fancy-hamburger-89099

05/24/2024, 8:43 AM

I raised the log level but I can't find anything that would indicate what could be wrong, it hangs on this

Copy code

2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/project_repo.go:78
[1.853ms] [rows:1] SELECT * FROM "projects" WHERE state <> 1 ORDER BY created_at desc

2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.467ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','development') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-development]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-development]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-development]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-development] already exists - attempting update instead","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-development] is not modified","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-development]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-development] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:26Z"}

2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.747ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','staging') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-staging]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-staging]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-staging]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-staging] already exists - attempting update instead","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-staging] is not modified","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-staging]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-staging] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:26Z"}

2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.788ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','production') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-production]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-production]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-production]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-production] already exists - attempting update instead","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-production] is not modified","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-production]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-production] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:615"},"level":"info","msg":"Completed cluster resource creation loop with stats: [{Created:0 Updated:0 AlreadyThere:3 Errored:0}]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:633"},"level":"info","msg":"Successfully completed cluster resource creation loop","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"execution_stats.go:63"},"level":"debug","msg":"Execution stats: ActiveExecutions: 0 ActiveNodes: 0, ActiveTasks: 0","ts":"2024-05-24T08:28:36Z"}

I think the fact that the probes are failing is just because the main container isn't able to go through the initialization process. And the issue is, that besides the Error code 137 and these logs, there isn't anything I could use for debugging. Also, looking into Grafana, the cpu and memory usage is even under the requests that I have set, so It does not look like it it lacks resources.

average-finland-92144

05/24/2024, 2:24 PM

@fancy-hamburger-89099 can you try removing

limits

? By default Flyte will make resource requests = limits. So setting limits is generally not very helpful for the K8s scheduler when it allocates resources, this is especially true with CPUs though

clean-glass-36808

05/24/2024, 3:18 PM

Wouldn’t that only be relevant for tasks, not the control/data plane? Another thing you can do is maybe shell into the pod and check the health API with curl or run netstat to see if the server is actually up or if there is a network issue with the health probe.

average-finland-92144

05/24/2024, 3:18 PM

Wouldn’t that only be relevant for tasks, not the control/data plane?

Yeah, I was about to correct my original statement 😅

fancy-hamburger-89099

06/03/2024, 7:03 AM

Hello @clean-glass-36808 @average-finland-92144, sorry for the delay, I had to focus on some other tasks. This morning I decided to spent more time looking for an solution, and randomly decided to disable auth, and it started working for some reason. I will try to enable it again, to see if I can find some error explaining what is wrong. Thank both of you for your time. Edit: After enabling auth again, I found this line in logs, which is likely the source of the issues:

Copy code

{
  "json": {
    "src": "service.go:347"
  },
  "level": "error",
  "msg": "Error creating auth context [AUTH_CONTEXT_SETUP_FAILED] Error creating oidc provider w/ issuer [<https://login.microsoftonline.com/><redacted>/oauth2/v2.0/authorize], caused by: 404 Not Found: ",
  "ts": "2024-06-03T07:10:53Z"
}

Edit2: Okay, this was a silly mistake on my side, I have set the baseUrl to

Copy code

<https://login.microsoftonline.com/><redacted>/oauth2/v2.0/authorize

While it was supposed to be set to

Copy code

<https://login.microsoftonline.com/><redacted>/v2.0

I have no idea why did I use the first URL because you are mentioning the correct URL in your docs, so I probably copied it from Azure without thinking about it. The issue is fixed, thanks a lot for your time!

average-finland-92144

06/04/2024, 1:47 PM

I have no idea why did I use the first URL because you are mentioning the correct URL in your docs

This was fixed recently, the docs actually had the URL wrong so you got it from there. Apologies. I'm glad it's working now!

10 Views

Open in Slack

Previous Next