Gaurav Kumar
08/07/2023, 2:36 PMEvents:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11m default-scheduler 0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
I also checked the kubectl describe node 31e2e2ba6f9f
(31e2e2ba6f9f
is the flyte-sandbox container in which flyte is running in sandbox environment)
Looks like node (docker container in this case) itself doesn’t have gpu to allocate to the pod
>kubectl describe node 31e2e2ba6f9f
...
Capacity:
cpu: 8 ephemeral-storage: 944801904Ki
hugepages-1Gi: 0 hugepages-2Mi: 0
memory: 65774996Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 919103291491 hugepages-1Gi: 0
hugepages-2Mi: 0 memory: 65774996Ki
pods: 110
...
I think this is happening because the flyte-sandbox docker container is not able to access gpus. I logged into the container and it doesn’t give output of nvidia-smi
which is required to work for container to access gpus.
❯ docker exec -it 31e2e2ba6f9f /bin/sh
/ # nvidia-smi
/bin/sh: nvidia-smi: not found
/ #
Note that outside of flyte, I’m able to run the task image in a docker container in my host machine as below which needs --gpus all
to be passed for container to access gpus
docker run --gpus all <task_image_with_gpu_req>
I think if we can somehow pass the --gpus all
to the docker run command for sandbox during flytectl demo start
command, it should work.
Please help !!!--gpus all
argument missing to the docker run cmd for flyte-sandbox container.--gpus all
, looks like we need to set up k8s compatibility with gpus using https://github.com/NVIDIA/k8s-device-plugin
It’s not working for me atm which looks be the issue.Eduardo Apolinario (eapolinario)
08/07/2023, 7:24 PMjeev
Gaurav Kumar
08/08/2023, 3:57 AMnone
or kvm2
drivers as listed here . Please help me answer below questions if you know the answer.
• Since, for sandbox, docker driver is used, that’s why it’s not working for me. I’d appreciate if you can refer me to some threads who got this working with sandbox.
• Another option is to install it some other way on my local machine to give it a try. Please recommend any other installation method (helm install …) that doesn’t require docker driver for k8s.
• Also, for this change https://github.com/flyteorg/flyte/pull/3256, once this is merged, does this mean k8s pod should be able to access gpu if we start with flyctl demo start
. Is my understanding correct ?jeev
Gaurav Kumar
08/08/2023, 4:21 AMcd docker/sandbox-bundled/
3. make build
4. Initially it gives below error
Error: lookup_func.go:106: [ERROR] unable to retrieve resource list for: v1 , error: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
lookup_func.go:80: [ERROR] unable to get apiresource from unstructured: /v1, Kind=Secret , error Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
Error: template: flyte-sandbox/charts/postgresql/templates/secrets.yaml:17:24: executing "flyte-sandbox/charts/postgresql/templates/secrets.yaml" at <include "common.secrets.passwords.manage" (dict "secret" (include "common.names.fullname" .) "key" "postgres-password" "providedValues" (list "global.postgresql.auth.postgresPassword" "auth.postgresPassword") "context" $)>: error calling include: template: flyte-sandbox/charts/minio/charts/common/templates/_secrets.tpl:93:20: executing "common.secrets.passwords.manage" at <lookup "v1" "Secret" (include "common.names.namespace" .context) .secret>: error calling lookup: unable to get apiresource from unstructured: /v1, Kind=Secret: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
5. It was resolve after I started a minikube cluster
using minikube start
6. Then, if I run make build
again, this error comes
Error: Error: template: flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:37:35: executing "flyte-sandbox/charts/flyte-binary/templates/deployment.yaml" at <include (print $.Template.BasePath "/configmap.yaml") .>: error calling include: template: flyte-sandbox/charts/flyte-binary/templates/configmap.yaml:196:8: executing "flyte-sandbox/charts/flyte-binary/templates/configmap.yaml" at <tpl (.Values.configuration.inline | toYaml) .>: error calling tpl: error during tpl function execution for "plugins:\n k8s:\n default-env-vars:\n - FLYTE_AWS_ENDPOINT: http://{{ printf \"%!s(MISSING)-minio\" .Release.Name | trunc 63 | trimSuffix\n \"-\" }}.{{ .Release.Namespace }}:9000\n - FLYTE_AWS_ACCESS_KEY_ID: minio\n - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage\nstorage:\n signedURL:\n stowConfigOverride:\n endpoint: <http://localhost:30002>": parse error at (flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:4): unclosed action
: unable to run: 'helm template flyte-sandbox /home/gakumar/projects/flyte/flyte/charts/flyte-sandbox --namespace flyte -f /tmp/kustomize-helm-23569202/flyte-sandbox-kustomize-values.yaml' with env=[HELM_CONFIG_HOME=/tmp/kustomize-helm-23569202/helm HELM_CACHE_HOME=/tmp/kustomize-helm-23569202/helm/.cache HELM_DATA_HOME=/tmp/kustomize-helm-23569202/helm/.data] (is 'helm' installed?): exit status 1
make: *** [Makefile:22: manifests] Error 1
I’m facing same error with and without https://github.com/flyteorg/flyte/pull/3256. Could you please let me know what I am missing here ???manifests-gpu
from build-gpu
section. k8 was able to access gpu and assign pod to the local node. Thanks you your helps.
However, I faced another issue due to image build mentioned here . But, I think we are good to close this thread ! Thank you.