So folks I have a job thats been running for the l...
# ask-the-community
r
So folks I have a job thats been running for the last 16 minutes on the sandbox (typically takes around 2 min). I see that it says Parallelism
0
on the web ui. Am I missing something here ?
so it seems like it didn’t scheduled at all:
Copy code
Name:             ang6jngn42zjdh69tqrp-n0-0
Namespace:        flytesnacks-development
Priority:         0
Service Account:  default
Node:             <none>
Labels:           domain=development
                  execution-id=ang6jngn42zjdh69tqrp
                  interruptible=false
                  node-id=n0
                  project=flytesnacks
                  shard-key=12
                  task-name=workflows-primary-retsynth-sample
                  workflow-name=workflows-primary-wf
Annotations:      <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    flyteworkflow/ang6jngn42zjdh69tqrp
Containers:
  ang6jngn42zjdh69tqrp-n0-0:
    Image:      localhost:30000/retsynth:be00641f3da51673c039189feb16bf269d2708a7
    Port:       <none>
    Host Port:  <none>
    Args:
      pyflyte-execute
      --inputs
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ang6jngn42zjdh69tqrp/n0/data/inputs.pb>
      --output-prefix
      <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-ang6jngn42zjdh69tqrp/n0/data/0>
      --raw-output-data-prefix
      <s3://my-s3-bucket/data/bu/ang6jngn42zjdh69tqrp-n0-0>
      --checkpoint-path
      <s3://my-s3-bucket/data/bu/ang6jngn42zjdh69tqrp-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      workflows.primary
      task-name
      retsynth_sample
    Limits:
      cpu:     2
      memory:  200Mi
    Requests:
      cpu:     2
      memory:  200Mi
    Environment:
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://workflows.primary.wf|workflows.primary.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        ang6jngn42zjdh69tqrp
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.primary.retsynth_sample
      FLYTE_INTERNAL_TASK_VERSION:        be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.primary.retsynth_sample
      FLYTE_INTERNAL_VERSION:             be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
      FLYTE_AWS_ENDPOINT:                 <http://flyte-sandbox-minio.flyte:9000>
      FLYTE_INTERNAL_EXECUTION_WORKFLOW:  flytesnacks:development:<http://workflows.primary.wf|workflows.primary.wf>
      FLYTE_INTERNAL_EXECUTION_ID:        ang6jngn42zjdh69tqrp
      FLYTE_INTERNAL_EXECUTION_PROJECT:   flytesnacks
      FLYTE_INTERNAL_EXECUTION_DOMAIN:    development
      FLYTE_ATTEMPT_NUMBER:               0
      FLYTE_INTERNAL_TASK_PROJECT:        flytesnacks
      FLYTE_INTERNAL_TASK_DOMAIN:         development
      FLYTE_INTERNAL_TASK_NAME:           workflows.primary.retsynth_sample
      FLYTE_INTERNAL_TASK_VERSION:        be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_INTERNAL_PROJECT:             flytesnacks
      FLYTE_INTERNAL_DOMAIN:              development
      FLYTE_INTERNAL_NAME:                workflows.primary.retsynth_sample
      FLYTE_INTERNAL_VERSION:             be00641f3da51673c039189feb16bf269d2708a71
      FLYTE_AWS_ENDPOINT:                 <http://flyte-sandbox-minio.flyte:9000>
      FLYTE_AWS_ACCESS_KEY_ID:            minio
      FLYTE_AWS_SECRET_ACCESS_KEY:        miniostorage
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z7crm (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-z7crm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  28m   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  22m   default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
root@ubuntu-s-4vcpu-8gb-intel
k
cc @Dan Rammer (hamersaw) @Radhakrishna Sanka - new PR's are merged and will be released soon, that gives you a lot more visibility. 1. scheduler states 2. timeline view
f
Seems like you don't have a node with enough CPU resources in your sandbox cluster...?
d
@Radhakrishna Sanka we have done quite a bit of work on task observability that will land in the next few weeks. Two things to highlight here: (1) 1.5 will have an updated status message in the task execution pane. So the information that "0/1 nodes are available: 1 Insufficient cpu ..." that you're seeing in k8s will be visible in the UI. In addition, we're planning on overlaying a time-series of these messages on the timeline view in the UI. Therefore, users can see the changing state of tasks during execution. (2) We have implemented the notion of "runtime metrics" from this RFC which will manifest to the user by breaking down the timeline view in the UI into more fine-grained information. Think things like workflow / node / task setup and teardown times, plugin-level overhead, etc. This will provide much better information into what is actually happening within a workflow execution.
To solve this problem, exactly what @Felix Ruess had mentioned. It looks like there is insufficient CPU to schedule the Pod.
r
Thanks @Felix Ruess @Dan Rammer (hamersaw). I’m gonna try spinning stuff down and seeing if I can get it to work. I think a key detail is that I do have a custom docker container.
f
That does not matter for the scheduler, only requested and available CPUs matter.