Jakub Hovorka
05/23/2024, 6:53 AM2024/05/23 06:44:57 /flyteorg/build/datacatalog/pkg/repositories/handle.go:79
[2.518ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-dev-flyte-binary-webhook","servicePort":443,"secretName":"flyte-dev-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"<http://docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4|docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"<http://gcr.io/google.com/cloudsdktool/cloud-sdk:alpine|gcr.io/google.com/cloudsdktool/cloud-sdk:alpine>","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}
Snippet from kubectl describe
Containers:
flyte:
Container ID: <containerd://3e4eff2d1ae0be9928d7507b421c1105056fa94e833a690c81aa1bcf7ec7ae3>f
Image: <http://cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0|cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0>
Image ID: <http://cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6|cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6>
Ports: 8088/TCP, 8089/TCP, 9443/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
start
--config
/etc/flyte/config.d/*.yaml
State: Running
Started: Thu, 23 May 2024 08:51:18 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 23 May 2024 08:44:56 +0200
Finished: Thu, 23 May 2024 08:46:16 +0200
Ready: False
Restart Count: 18
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 100m
memory: 500Mi
And these are my Helm values
flyte-binary:
deployment:
resources:
requests:
cpu: 100m
memory: 500Mi
limits:
cpu: 1
memory: 2Gi
configuration:
inline:
task_resources:
defaults:
cpu: 100m
memory: 100Mi
storage: 100Mi
limits:
memory: 1Gi
storage:
metadataContainer: "flyte"
userDataContainer: "flyte"
providerConfig:
s3:
# v2Signing Flag to sign requests with v2 signature
# Useful for s3-compatible blob stores (e.g. minio)
v2Signing: true
# authType Type of authentication to use for connecting to S3-compatible service (Supported values: iam, accesskey)
authType: "accesskey"
auth:
enabled: true
oidc:
baseUrl: <https://login.microsoftonline.com/xxx/oauth2/v2.0/authorize>
clientId: xxx
ingress:
create: true
# commonAnnotations Add common annotations to all ingress resources
commonAnnotations:
<http://cert-manager.io/cluster-issuer|cert-manager.io/cluster-issuer>: letsencrypt-dns01
# httpAnnotations Add annotations to http ingress resource
httpAnnotations: {}
# grpcAnnotations Add annotations to grpc ingress resource
grpcAnnotations: {}
ingressClassName: "nginx-internal"
On top of these, I also apply other values that are environment specific
flyte-binary:
configuration:
database:
username: flyte_dev
host: xxx
dbname: flyte_dev
storage:
providerConfig:
s3:
endpoint: "minio-api.xxx.cloud"
accessKey: "xxx"
auth:
authorizedUris:
- <https://flyte.xxx.cloud>
ingress:
host: "flyte.xxx.cloud"
tls:
- hosts:
- flyte.xxx.cloud
secretName: flyte-dev-app-certificate
And
flyte-binary:
configuration:
database:
password: "xxx"
storage:
providerConfig:
s3:
secretKey: "xxx"
auth:
oidc:
clientSecret: xxx
internal:
clientSecret: 'xxx'
clientSecretHash: "xxx"
Can anyone assist me with this? Thank you very much!Jason Parraga
05/23/2024, 3:18 PMJakub Hovorka
05/23/2024, 3:19 PMJason Parraga
05/23/2024, 3:21 PMJakub Hovorka
05/23/2024, 3:22 PMJason Parraga
05/23/2024, 5:25 PMJakub Hovorka
05/24/2024, 5:34 AMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m default-scheduler Successfully assigned devops-dev/flyte-nl-main-dev-dev-flyte-binary-86989b4b7b-7h2zg to aks-nodepool-18165302-vmss0000hg
Normal Pulling 8m kubelet Pulling image "postgres:15-alpine"
Normal Pulled 7m54s kubelet Successfully pulled image "postgres:15-alpine" in 6.061s (6.061s including waiting)
Normal Created 7m54s kubelet Created container wait-for-db
Normal Started 7m54s kubelet Started container wait-for-db
Normal Pulling 7m51s kubelet Pulling image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0"
Normal Pulled 7m47s kubelet Successfully pulled image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0" in 4.747s (4.747s including waiting)
Normal Created 7m47s kubelet Created container gen-admin-auth-secret
Normal Started 7m47s kubelet Started container gen-admin-auth-secret
Warning Unhealthy 6m51s (x3 over 7m11s) kubelet Liveness probe failed: Get "<http://10.64.10.92:8088/healthcheck>": dial tcp 10.64.10.92:8088: connect: connection refused
Normal Killing 6m51s kubelet Container flyte failed liveness probe, will be restarted
Normal Pulled 6m21s (x2 over 7m43s) kubelet Container image "cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0" already present on machine
Normal Created 6m21s (x2 over 7m43s) kubelet Created container flyte
Normal Started 6m20s (x2 over 7m43s) kubelet Started container flyte
Warning Unhealthy 2m41s (x22 over 7m11s) kubelet Readiness probe failed: Get "<http://10.64.10.92:8088/healthcheck>": dial tcp 10.64.10.92:8088: connect: connection refused
And these are the Container Statuses
containerStatuses:
- containerID: <containerd://47e2d84bdb012da83b2e5639293bcf3a90f91cd16a3340846865211baa14666>0
image: cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0
imageID: cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6
lastState:
terminated:
containerID: <containerd://849c13aa05c244f12a8c0f677f49e1c481f9f5012d0b548522ea8282ccc31d6>0
exitCode: 137
finishedAt: "2024-05-24T05:32:47Z"
reason: Error
startedAt: "2024-05-24T05:31:18Z"
name: flyte
ready: false
restartCount: 6
started: true
state:
running:
startedAt: "2024-05-24T05:32:48Z"
hostIP: 10.41.0.11
initContainerStatuses:
- containerID: <containerd://9538ffbd902f3c5e90df2c2d30dc5be64918c95d3bba9b70d086812372b61f5>d
image: docker.io/library/postgres:15-alpine
imageID: docker.io/library/postgres@sha256:0cec11eaf51a9af24c27a09cae9840a9234336e5bf9edc5fdf67b3174ba05210
lastState: {}
name: wait-for-db
ready: true
restartCount: 0
started: false
state:
terminated:
containerID: <containerd://9538ffbd902f3c5e90df2c2d30dc5be64918c95d3bba9b70d086812372b61f5>d
exitCode: 0
finishedAt: "2024-05-24T05:23:44Z"
reason: Completed
startedAt: "2024-05-24T05:23:44Z"
- containerID: <containerd://11511596ee8c8655201ad3e6f3e1c3218f8123eaf68ffc4b01ccc31f1ecb0c9>6
image: cr.flyte.org/flyteorg/flyte-binary-release:v1.12.0
imageID: cr.flyte.org/flyteorg/flyte-binary-release@sha256:896c0699d47a226ea31fb113fe40ec4f1ffe8ddebca358496022968230cad9e6
lastState: {}
name: gen-admin-auth-secret
ready: true
restartCount: 0
started: false
state:
terminated:
containerID: <containerd://11511596ee8c8655201ad3e6f3e1c3218f8123eaf68ffc4b01ccc31f1ecb0c9>6
exitCode: 0
finishedAt: "2024-05-24T05:23:54Z"
reason: Completed
startedAt: "2024-05-24T05:23:51Z"
But I only see the 137 error.
Edit: I tried to extend the delay period of the probe to 5 minutes, and the pod is still stuck with this thing as a last record in the log
2024/05/24 05:36:20 /flyteorg/build/datacatalog/pkg/repositories/handle.go:79
[1.147ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
{"metrics-prefix":"flyte:","certDir":"/var/run/flyte/certs","localCert":true,"listenPort":9443,"serviceName":"flyte-nl-main-dev-dev-flyte-binary-webhook","servicePort":443,"secretName":"flyte-nl-main-dev-dev-flyte-binary-webhook-secret","secretManagerType":"K8s","awsSecretManager":{"sidecarImage":"docker.io/amazon/aws-secrets-manager-secret-sidecar:v0.1.4","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"gcpSecretManager":{"sidecarImage":"gcr.io/google.com/cloudsdktool/cloud-sdk:alpine","resources":{"limits":{"cpu":"200m","memory":"500Mi"},"requests":{"cpu":"200m","memory":"500Mi"}}},"vaultSecretManager":{"role":"flyte","kvVersion":"2","annotations":null}}
Is there a way how to get more information about what could be wrong?Jason Parraga
05/24/2024, 5:58 AMJason Parraga
05/24/2024, 6:05 AMJakub Hovorka
05/24/2024, 8:43 AM2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/project_repo.go:78
[1.853ms] [rows:1] SELECT * FROM "projects" WHERE state <> 1 ORDER BY created_at desc
2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.467ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','development') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-development]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-development]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-development]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-development] already exists - attempting update instead","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-development] is not modified","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-development]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-development] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:26Z"}
2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.747ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','staging') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-staging]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-staging]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-staging]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-staging] already exists - attempting update instead","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-staging] is not modified","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-staging]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-staging] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:26Z"}
2024/05/24 08:28:26 /flyteorg/build/flyteadmin/pkg/repositories/gormimpl/resource_repo.go:118
[0.788ms] [rows:0] SELECT * FROM "resources" WHERE resource_type = 'CLUSTER_RESOURCE' AND domain IN ('','production') AND project IN ('','flytesnacks') AND workflow IN ('') AND launch_plan IN ('') ORDER BY priority desc,"resources"."id" LIMIT 1
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-production]: ignoring unrecognized filetype [..2024_05_24_08_24_35.1348902394]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:297"},"level":"debug","msg":"syncing namespace [flytesnacks-production]: ignoring unrecognized filetype [..data]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:477"},"level":"debug","msg":"successfully read template config file [namespace.yaml]","ts":"2024-05-24T08:28:26Z"}
{"json":{"src":"controller.go:329"},"level":"debug","msg":"Attempting to create resource [Namespace] in cluster [] for namespace [flytesnacks-production]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:337"},"level":"debug","msg":"Type [Namespace] in namespace [flytesnacks-production] already exists - attempting update instead","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:372"},"level":"info","msg":"Resource [Namespace] in namespace [flytesnacks-production] is not modified","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:607"},"level":"debug","msg":"Successfully created kubernetes resources for [flytesnacks-production]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:611"},"level":"info","msg":"Completed cluster resource creation loop for namespace [flytesnacks-production] with stats: [{Created:0 Updated:0 AlreadyThere:1 Errored:0}]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:615"},"level":"info","msg":"Completed cluster resource creation loop with stats: [{Created:0 Updated:0 AlreadyThere:3 Errored:0}]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"controller.go:633"},"level":"info","msg":"Successfully completed cluster resource creation loop","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:27Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:28Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:29Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:30Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:31Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:32Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:33Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:34Z"}
{"json":{"src":"composite_workqueue.go:88"},"level":"debug","msg":"Subqueue handler batch round","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"composite_workqueue.go:98"},"level":"debug","msg":"Dynamically configured batch size [-1]","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"composite_workqueue.go:129"},"level":"debug","msg":"Exiting SubQueue handler batch round","ts":"2024-05-24T08:28:35Z"}
{"json":{"src":"execution_stats.go:63"},"level":"debug","msg":"Execution stats: ActiveExecutions: 0 ActiveNodes: 0, ActiveTasks: 0","ts":"2024-05-24T08:28:36Z"}
I think the fact that the probes are failing is just because the main container isn't able to go through the initialization process. And the issue is, that besides the Error code 137 and these logs, there isn't anything I could use for debugging. Also, looking into Grafana, the cpu and memory usage is even under the requests that I have set, so It does not look like it it lacks resources.David Espejo (he/him)
05/24/2024, 2:24 PMlimits
? By default Flyte will make resource requests = limits. So setting limits is generally not very helpful for the K8s scheduler when it allocates resources, this is especially true with CPUs thoughJason Parraga
05/24/2024, 3:18 PMDavid Espejo (he/him)
05/24/2024, 3:18 PMWouldn’t that only be relevant for tasks, not the control/data plane?Yeah, I was about to correct my original statement 😅
Jakub Hovorka
06/03/2024, 7:03 AM{
"json": {
"src": "service.go:347"
},
"level": "error",
"msg": "Error creating auth context [AUTH_CONTEXT_SETUP_FAILED] Error creating oidc provider w/ issuer [<https://login.microsoftonline.com/><redacted>/oauth2/v2.0/authorize], caused by: 404 Not Found: ",
"ts": "2024-06-03T07:10:53Z"
}
Edit2: Okay, this was a silly mistake on my side, I have set the baseUrl to
<https://login.microsoftonline.com/><redacted>/oauth2/v2.0/authorize
While it was supposed to be set to
<https://login.microsoftonline.com/><redacted>/v2.0
I have no idea why did I use the first URL because you are mentioning the correct URL in your docs, so I probably copied it from Azure without thinking about it.
The issue is fixed, thanks a lot for your time!David Espejo (he/him)
06/04/2024, 1:47 PMI have no idea why did I use the first URL because you are mentioning the correct URL in your docsThis was fixed recently, the docs actually had the URL wrong so you got it from there. Apologies. I'm glad it's working now!