Hi again everyone, <@U04H6UUE78B> has been helping...
# ask-the-community
s
Hi again everyone, @David Espejo (he/him) has been helping me with my AWS deployment of Flyte. We're nearly there, got a wf deployed, but it just hangs and doesn't complete or fail. I'm trying to execute the greeting wf (works fine in the sandbox). These are the events from the
kubectl describe pods
cmd. Anyone got any ideas?
Copy code
Type  Reason   Age  From        Message
 ----  ------   ----  ----        -------
 Normal Scheduled 8m56s default-scheduler Successfully assigned flyte/flyte-backend-flyte-binary-5b978865bb-r757k to ip-10-250-3-158.eu-west-1.compute.internal
 Normal Pulled   8m56s kubelet      Container image "postgres:15-alpine" already present on machine
 Normal Created  8m56s kubelet      Created container wait-for-db
 Normal Started  8m55s kubelet      Started container wait-for-db
 Normal Pulled   8m55s kubelet      Container image "<http://cr.flyte.org/flyteorg/flyte-binary-release:v1.4.3|cr.flyte.org/flyteorg/flyte-binary-release:v1.4.3>" already present on machine
 Normal Created  8m55s kubelet      Created container flyte
 Normal Started  8m54s kubelet      Started container flyte
The
kubectl log
cmd doesn't contain anything useful, just a bunch of SQL commands
d
I'm able to reproduce. This an EKS 1.25 environment, flyte-binary chart and using IRSA. Everything from the deployment POV double checked and seems to be fine
@Samuel Bentley please share output of: 1.
kubectl get po -n flytesnacks-development
2.
kubectl describe pod <pod-name> -n flytesnacks-development
s
Hi @David Espejo (he/him), here you go...
Copy code
% kubectl get po -n flytesnacks-development
NAME            READY  STATUS  RESTARTS  AGE
f71ae81256b5d437aa84-n0-0  0/1   Pending  0     16h
For point 2, I get
Copy code
% kubectl describe pod flyte-backend-flyte-binary-5b978865bb-r757k -n flytesnacks-development
Error from server (NotFound): pods "flyte-backend-flyte-binary-5b978865bb-r757k" not found
Unless you meant
-n flyte
Copy code
% kubectl describe pod flyte-backend-flyte-binary-5b978865bb-r757k -n flyte          
Name:       flyte-backend-flyte-binary-5b978865bb-r757k
Namespace:    flyte
Priority:     0
Service Account: flyte-backend-flyte-binary
Node:       ip-10-250-3-158.eu-west-1.compute.internal/10.250.3.158
Start Time:    Wed, 26 Apr 2023 17:14:46 +0100
Labels:      <http://app.kubernetes.io/instance=flyte-backend|app.kubernetes.io/instance=flyte-backend>
         <http://app.kubernetes.io/name=flyte-binary|app.kubernetes.io/name=flyte-binary>
         pod-template-hash=5b978865bb
Annotations:   checksum/cluster-resource-templates: 9dc51bb64ed68c61ffa4f5dff19868785a31b86610f901b5f26faa9c6287c802
         checksum/configuration: e010fe642ccfa3ec064101401ceeaa0535ebe141ebb4be89156dd2d18a54a6cd
         checksum/db-password-secret: 9567e3f44e773036cc72041284c2418d14c9fdf05a35ba61184f95cd2a6bae4f
Status:      Running
IP:        10.250.3.136
IPs:
 IP:      10.250.3.136
Controlled By: ReplicaSet/flyte-backend-flyte-binary-5b978865bb
Init Containers:
 wait-for-db:
  Container ID: <containerd://8a9c1652eb42cb2e686f6991e7fb3d12af8b428542cff114beec519c23a99d9>3
  Image:     postgres:15-alpine
  Image ID:   <http://docker.io/library/postgres@sha256:0ce2f7c363133126dbcb1d3409dc523ecd243c55ca95a19b9b3c73c31c670b4a|docker.io/library/postgres@sha256:0ce2f7c363133126dbcb1d3409dc523ecd243c55ca95a19b9b3c73c31c670b4a>
  Port:     <none>
  Host Port:   <none>
  Command:
   sh
   -ec
  Args:
   until pg_isready \
    -h <http://flyteadmin.cluster-cqyalto63rvx.eu-west-1.rds.amazonaws.com|flyteadmin.cluster-cqyalto63rvx.eu-west-1.rds.amazonaws.com> \
    -p 5432 \
    -U flyteadmin
   do
    echo waiting for database
    sleep 0.1
   done
    
  State:     Terminated
   Reason:    Completed
   Exit Code:  0
   Started:   Wed, 26 Apr 2023 17:14:47 +0100
   Finished:   Wed, 26 Apr 2023 17:14:47 +0100
  Ready:     True
  Restart Count: 0
  Environment:
   AWS_STS_REGIONAL_ENDPOINTS:  regional
   AWS_DEFAULT_REGION:      eu-west-1
   AWS_REGION:          eu-west-1
   AWS_ROLE_ARN:         arn:aws:iam::276767680874:role/flyte-system-role
   AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  Mounts:
   /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ss8jm (ro)
Containers:
 flyte:
  Container ID: <containerd://c484d9f5e186a6acb3fc33aabb2fd810c6de315ea6b1eb527c04833cb6a72fe>6
  Image:     <http://cr.flyte.org/flyteorg/flyte-binary-release:v1.4.3|cr.flyte.org/flyteorg/flyte-binary-release:v1.4.3>
  Image ID:   <http://cr.flyte.org/flyteorg/flyte-binary-release@sha256:eee17e38ba877bd034b7fd271ac0799e94560470f2e672ef4293fa5ab8a75d99|cr.flyte.org/flyteorg/flyte-binary-release@sha256:eee17e38ba877bd034b7fd271ac0799e94560470f2e672ef4293fa5ab8a75d99>
  Ports:     8088/TCP, 8089/TCP, 9443/TCP
  Host Ports:  0/TCP, 0/TCP, 0/TCP
  Args:
   start
   --config
   /etc/flyte/config.d/*.yaml
  State:     Running
   Started:   Wed, 26 Apr 2023 17:14:48 +0100
  Ready:     True
  Restart Count: 0
  Liveness:    http-get http://:http/healthcheck delay=0s timeout=1s period=10s #success=1 #failure=3
  Readiness:   http-get http://:http/healthcheck delay=0s timeout=1s period=10s #success=1 #failure=3
  Environment:
   POD_NAME:           flyte-backend-flyte-binary-5b978865bb-r757k (v1:metadata.name)
   POD_NAMESPACE:        flyte (v1:metadata.namespace)
   AWS_STS_REGIONAL_ENDPOINTS:  regional
   AWS_DEFAULT_REGION:      eu-west-1
   AWS_REGION:          eu-west-1
   AWS_ROLE_ARN:         arn:aws:iam::276767680874:role/flyte-system-role
   AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
  Mounts:
   /etc/flyte/cluster-resource-templates from cluster-resource-templates (rw)
   /etc/flyte/config.d from config (rw)
   /var/run/flyte from state (rw)
   /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
   /var/run/secrets/flyte/db-pass from db-pass (rw,path="db-pass")
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ss8jm (ro)
Conditions:
 Type       Status
 Initialized    True 
 Ready       True 
 ContainersReady  True 
 PodScheduled   True 
Volumes:
 aws-iam-token:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 86400
 cluster-resource-templates:
  Type:   ConfigMap (a volume populated by a ConfigMap)
  Name:   flyte-backend-flyte-binary-cluster-resource-templates
  Optional: false
 config:
  Type:   ConfigMap (a volume populated by a ConfigMap)
  Name:   flyte-backend-flyte-binary-config
  Optional: false
 db-pass:
  Type:    Secret (a volume populated by a Secret)
  SecretName: flyte-backend-flyte-binary-db-pass
  Optional:  false
 state:
  Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
  Medium:   
  SizeLimit: <unset>
 kube-api-access-ss8jm:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 3607
  ConfigMapName:      kube-root-ca.crt
  ConfigMapOptional:    <nil>
  DownwardAPI:       true
QoS Class:          BestEffort
Node-Selectors:       <none>
Tolerations:         <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
               <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:           <none>
d
Thank you Please review point 2, the ask is to describe the pod resulting from Point 1 (the actual workflow pod instead of the binary)
s
Ah, got it
Copy code
% kubectl describe pod f71ae81256b5d437aa84-n0-0 -n flytesnacks-development
Name:       f71ae81256b5d437aa84-n0-0
Namespace:    flytesnacks-development
Priority:     0
Service Account: default
Node:       <none>
Labels:      domain=development
         execution-id=f71ae81256b5d437aa84
         interruptible=false
         node-id=n0
         project=flytesnacks
         shard-key=0
         task-name=workflows-greeting-say-hello
         workflow-name=workflows-greeting-wf
Annotations:   <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
Status:      Pending
IP:        
IPs:       <none>
Controlled By:  flyteworkflow/f71ae81256b5d437aa84
Containers:
 f71ae81256b5d437aa84-n0-0:
  Image:   <http://cr.flyte.org/flyteorg/flytekit:py3.9-1.5.0|cr.flyte.org/flyteorg/flytekit:py3.9-1.5.0>
  Port:    <none>
  Host Port: <none>
  Args:
   pyflyte-fast-execute
   --additional-distribution
   <s3://flyte-metadata-no-opta/flytesnacks/development/SJO6COG65S5D5CY6T3KHZKGKGY======/script_mode.tar.gz>
   --dest-dir
   /root
   --
   pyflyte-execute
   --inputs
   <s3://flyte-metadata-no-opta/metadata/propeller/flytesnacks-development-f71ae81256b5d437aa84/n0/data/inputs.pb>
   --output-prefix
   <s3://flyte-metadata-no-opta/metadata/propeller/flytesnacks-development-f71ae81256b5d437aa84/n0/data/0>
   --raw-output-data-prefix
   <s3://flyte-metadata-no-opta/data/am/f71ae81256b5d437aa84-n0-0>
   --checkpoint-path
   <s3://flyte-metadata-no-opta/data/am/f71ae81256b5d437aa84-n0-0/_flytecheckpoints>
   --prev-checkpoint
   ""
   --resolver
   flytekit.core.python_auto_container.default_task_resolver
   --
   task-module
   workflows.greeting
   task-name
   say_hello
  Limits:
   cpu:   2
   memory: 200Mi
  Requests:
   cpu:   2
   memory: 200Mi
  Environment:
   FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:<http://workflows.greeting.wf|workflows.greeting.wf>
   FLYTE_INTERNAL_EXECUTION_ID:    f71ae81256b5d437aa84
   FLYTE_INTERNAL_EXECUTION_PROJECT:  flytesnacks
   FLYTE_INTERNAL_EXECUTION_DOMAIN:  development
   FLYTE_ATTEMPT_NUMBER:        0
   FLYTE_INTERNAL_TASK_PROJECT:    flytesnacks
   FLYTE_INTERNAL_TASK_DOMAIN:     development
   FLYTE_INTERNAL_TASK_NAME:      workflows.greeting.say_hello
   FLYTE_INTERNAL_TASK_VERSION:    I6N0rf5symrfjnKffYDq0g==
   FLYTE_INTERNAL_PROJECT:       flytesnacks
   FLYTE_INTERNAL_DOMAIN:       development
   FLYTE_INTERNAL_NAME:        workflows.greeting.say_hello
   FLYTE_INTERNAL_VERSION:       I6N0rf5symrfjnKffYDq0g==
   AWS_METADATA_SERVICE_TIMEOUT:    5
   AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
   FLYTE_INTERNAL_EXECUTION_WORKFLOW: flytesnacks:development:<http://workflows.greeting.wf|workflows.greeting.wf>
   FLYTE_INTERNAL_EXECUTION_ID:    f71ae81256b5d437aa84
   FLYTE_INTERNAL_EXECUTION_PROJECT:  flytesnacks
   FLYTE_INTERNAL_EXECUTION_DOMAIN:  development
   FLYTE_ATTEMPT_NUMBER:        0
   FLYTE_INTERNAL_TASK_PROJECT:    flytesnacks
   FLYTE_INTERNAL_TASK_DOMAIN:     development
   FLYTE_INTERNAL_TASK_NAME:      workflows.greeting.say_hello
   FLYTE_INTERNAL_TASK_VERSION:    I6N0rf5symrfjnKffYDq0g==
   FLYTE_INTERNAL_PROJECT:       flytesnacks
   FLYTE_INTERNAL_DOMAIN:       development
   FLYTE_INTERNAL_NAME:        workflows.greeting.say_hello
   FLYTE_INTERNAL_VERSION:       I6N0rf5symrfjnKffYDq0g==
   AWS_METADATA_SERVICE_TIMEOUT:    5
   AWS_METADATA_SERVICE_NUM_ATTEMPTS: 20
  Mounts:
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r2cqx (ro)
Conditions:
 Type      Status
 PodScheduled  False 
Volumes:
 kube-api-access-r2cqx:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 3607
  ConfigMapName:      kube-root-ca.crt
  ConfigMapOptional:    <nil>
  DownwardAPI:       true
QoS Class:          Guaranteed
Node-Selectors:       <none>
Tolerations:         <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
               <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
 Type   Reason      Age          From        Message
 ----   ------      ----          ----        -------
 Warning FailedScheduling 2m21s (x255 over 21h) default-scheduler 0/2 nodes are available: 2 Insufficient cpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod.
Obvioulsy the Insufficient CPU jumps out, but this is the simple greeting wf, so it's surprising...
d
Thanks for sharing Samuel. Right, by default all tasks require 2 CPUs. Those can be customized on a per-task level or changed across a Flyte deployment. That said, we're changing the default task resources to require 0 cpus and 0 memory (meaning there will be no limit) in https://github.com/flyteorg/flyteadmin/pull/530, so it should be out soon. Using EKS nodes with 2 CPUs I was able to reproduce the error
103 Views