So folks I have a job thats been running for the last 16 min Flyte #flyte-support

So folks I have a job thats been running for the l...

miniature-plumber-7394

04/03/2023, 11:35 PM

So folks I have a job thats been running for the last 16 minutes on the sandbox (typically takes around 2 min). I see that it says Parallelism

on the web ui. Am I missing something here ?

miniature-plumber-7394

04/03/2023, 11:48 PM

so it seems like it didn’t scheduled at all:

miniature-plumber-7394

04/03/2023, 11:48 PM

Copy code

Name:             ang6jngn42zjdh69tqrp-n0-0
Namespace:        flytesnacks-development
Priority:         0
Service Account:  default
Node:             <none>
Labels:           domain=development
                  execution-id=ang6jngn42zjdh69tqrp
                  interruptible=false
                  node-id=n0
                  project=flytesnacks
                  shard-key=12
                  task-name=workflows-primary-retsynth-sample
                  workflow-name=workflows-primary-wf
Annotations:      <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    flyteworkflow/ang6jngn42zjdh69tqrp
Containers:
  ang6jngn42zjdh69tqrp-n0-0:
    Image:      localhost:30000/retsynth:be00641f3da51673c039189feb16bf269d2708a7
    Port:       <none>
    Host Port:  <none>
    Args:
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ang6jngn42zjdh69tqrp/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ang6jngn42zjdh69tqrp/n0/data/0>
      --raw-output-data-prefix
      <s3://my-s3-bucket/data/bu/ang6jngn42zjdh69tqrp-n0-0>
      --checkpoint-path
      <s3://my-s3-bucket/data/bu/ang6jngn42zjdh69tqrp-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.primary
      task-name
      retsynth_sample
    Limits:
      cpu:     2
      memory:  200Mi
    Requests:
      cpu:     2
      memory:  200Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://workflows.primary.wf|workflows.primary.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        ang6jngn42zjdh69tqrp
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.primary.retsynth_sample
      FLYTE_INTERNAL_TASK_VERSION:        be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.primary.retsynth_sample
      FLYTE_INTERNAL_VERSION:             be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      FLYTE_AWS_ENDPOINT:                 <http://flyte-sandbox-minio.flyte:9000>
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://workflows.primary.wf|workflows.primary.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        ang6jngn42zjdh69tqrp
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.primary.retsynth_sample
      FLYTE_INTERNAL_TASK_VERSION:        be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.primary.retsynth_sample
      FLYTE_INTERNAL_VERSION:             be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_AWS_ENDPOINT:                 <http://flyte-sandbox-minio.flyte:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z7crm (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-z7crm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28m   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  22m   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
root@ubuntu-s-4vcpu-8gb-intel

freezing-airport-6809

04/04/2023, 12:09 AM

cc @hallowed-mouse-14616 @miniature-plumber-7394 - new PR's are merged and will be released soon, that gives you a lot more visibility. 1. scheduler states 2. timeline view

quaint-diamond-37493

04/04/2023, 10:18 AM

Seems like you don't have a node with enough CPU resources in your sandbox cluster...?

hallowed-mouse-14616

04/04/2023, 1:20 PM

@miniature-plumber-7394 we have done quite a bit of work on task observability that will land in the next few weeks. Two things to highlight here: (1) 1.5 will have an updated status message in the task execution pane. So the information that "0/1 nodes are available: 1 Insufficient cpu ..." that you're seeing in k8s will be visible in the UI. In addition, we're planning on overlaying a time-series of these messages on the timeline view in the UI. Therefore, users can see the changing state of tasks during execution. (2) We have implemented the notion of "runtime metrics" from this RFC which will manifest to the user by breaking down the timeline view in the UI into more fine-grained information. Think things like workflow / node / task setup and teardown times, plugin-level overhead, etc. This will provide much better information into what is actually happening within a workflow execution.

🦜 1

👍 1

hallowed-mouse-14616

04/04/2023, 1:21 PM

To solve this problem, exactly what @quaint-diamond-37493 had mentioned. It looks like there is insufficient CPU to schedule the Pod.

miniature-plumber-7394

04/04/2023, 4:32 PM

Thanks @quaint-diamond-37493 @hallowed-mouse-14616. I’m gonna try spinning stuff down and seeing if I can get it to work. I think a key detail is that I do have a custom docker container.

quaint-diamond-37493

04/04/2023, 4:44 PM

That does not matter for the scheduler, only requested and available CPUs matter.

👍 1

154 Views

Open in Slack

Previous Next