Hi, I’m using flyte sandbox and trying to run some...
# ask-the-community
g
Hi, I’m using flyte sandbox and trying to run some tasks in flyte. I have a task that needs to access gpu from my host machine. Since, I have gpu is in my host machine, it needs to be passed on to the k8 pod that’s running inside the docker container (sandbox). However, I see that kubectl is not able to schedule the pod.
Copy code
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  11m    default-scheduler  0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
I also checked the
kubectl describe node 31e2e2ba6f9f
(
31e2e2ba6f9f
is the flyte-sandbox container in which flyte is running in sandbox environment) Looks like node (docker container in this case) itself doesn’t have gpu to allocate to the pod
Copy code
>kubectl describe node 31e2e2ba6f9f
...
Capacity:
  cpu:                8                                                                                                                                                       ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0                                                                                                                                                   hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491                                                                                                                                        hugepages-1Gi:      0
  hugepages-2Mi:      0                                                                                                                                                   memory:             65774996Ki
  pods:               110
...
I think this is happening because the flyte-sandbox docker container is not able to access gpus. I logged into the container and it doesn’t give output of
nvidia-smi
which is required to work for container to access gpus.
Copy code
❯ docker exec -it 31e2e2ba6f9f /bin/sh
/ # nvidia-smi
/bin/sh: nvidia-smi: not found
/ #
Note that outside of flyte, I’m able to run the task image in a docker container in my host machine as below which needs
--gpus all
to be passed for container to access gpus
Copy code
docker run --gpus all <task_image_with_gpu_req>
I think if we can somehow pass the
--gpus all
to the docker run command for sandbox during
flytectl demo start
command, it should work. Please help !!!
Is there any recommended way in which we can bring up flyte on a local machine so that pods can access gpus. I’m not sure if sandbox supports that because of
--gpus all
argument missing to the docker run cmd for flyte-sandbox container.
I think it’s nothing to do with
--gpus all
, looks like we need to set up k8s compatibility with gpus using https://github.com/NVIDIA/k8s-device-plugin It’s not working for me atm which looks be the issue.
e
cc: @jeev
j
this isn’t currently supported OOB unfortunately, but there are a few threads where folks have gotten this to work
@Gaurav Kumar: This might be informative: https://github.com/flyteorg/flyte/pull/3256
g
@jeev I did some digging and checked that https://github.com/NVIDIA/k8s-device-plugin only worked with
none
or
kvm2
drivers as listed here . Please help me answer below questions if you know the answer. • Since, for sandbox, docker driver is used, that’s why it’s not working for me. I’d appreciate if you can refer me to some threads who got this working with sandbox. • Another option is to install it some other way on my local machine to give it a try. Please recommend any other installation method (helm install …) that doesn’t require docker driver for k8s. • Also, for this change https://github.com/flyteorg/flyte/pull/3256, once this is merged, does this mean k8s pod should be able to access gpu if we start with
flyctl demo start
. Is my understanding correct ?
j
@Gaurav KumarKumar: i believe the PR comments say to build the image manually and use it as the sandbox image
i haven’t tried it myself unfortunately
g
Let me try that. Thanks
Hi @jeev I’m trying to build the image like you said. I’m facing the same issue with master repo without this https://github.com/flyteorg/flyte/pull/3256 patch as well. Here’s what I did in the case of master flyte repo. 1. Clone the repo 2.
cd docker/sandbox-bundled/
3.
make build
4. Initially it gives below error
Copy code
Error: lookup_func.go:106: [ERROR] unable to retrieve resource list for: v1 , error: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
lookup_func.go:80: [ERROR] unable to get apiresource from unstructured: /v1, Kind=Secret , error Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
Error: template: flyte-sandbox/charts/postgresql/templates/secrets.yaml:17:24: executing "flyte-sandbox/charts/postgresql/templates/secrets.yaml" at <include "common.secrets.passwords.manage" (dict "secret" (include "common.names.fullname" .) "key" "postgres-password" "providedValues" (list "global.postgresql.auth.postgresPassword" "auth.postgresPassword") "context" $)>: error calling include: template: flyte-sandbox/charts/minio/charts/common/templates/_secrets.tpl:93:20: executing "common.secrets.passwords.manage" at <lookup "v1" "Secret" (include "common.names.namespace" .context) .secret>: error calling lookup: unable to get apiresource from unstructured: /v1, Kind=Secret: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
5. It was resolve after I started a
minikube cluster
using
minikube start
6. Then, if I run
make build
again, this error comes
Copy code
Error: Error: template: flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:37:35: executing "flyte-sandbox/charts/flyte-binary/templates/deployment.yaml" at <include (print $.Template.BasePath "/configmap.yaml") .>: error calling include: template: flyte-sandbox/charts/flyte-binary/templates/configmap.yaml:196:8: executing "flyte-sandbox/charts/flyte-binary/templates/configmap.yaml" at <tpl (.Values.configuration.inline | toYaml) .>: error calling tpl: error during tpl function execution for "plugins:\n  k8s:\n    default-env-vars:\n    - FLYTE_AWS_ENDPOINT: http://{{ printf \"%!s(MISSING)-minio\" .Release.Name | trunc 63 | trimSuffix\n        \"-\" }}.{{ .Release.Namespace }}:9000\n    - FLYTE_AWS_ACCESS_KEY_ID: minio\n    - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage\nstorage:\n  signedURL:\n    stowConfigOverride:\n      endpoint: <http://localhost:30002>": parse error at (flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:4): unclosed action
: unable to run: 'helm template flyte-sandbox /home/gakumar/projects/flyte/flyte/charts/flyte-sandbox --namespace flyte -f /tmp/kustomize-helm-23569202/flyte-sandbox-kustomize-values.yaml' with env=[HELM_CONFIG_HOME=/tmp/kustomize-helm-23569202/helm HELM_CACHE_HOME=/tmp/kustomize-helm-23569202/helm/.cache HELM_DATA_HOME=/tmp/kustomize-helm-23569202/helm/.data] (is 'helm' installed?): exit status 1
make: *** [Makefile:22: manifests] Error 1
I’m facing same error with and without https://github.com/flyteorg/flyte/pull/3256. Could you please let me know what I am missing here ???
@jeev I was able to make progress to build image by removing
manifests-gpu
from
build-gpu
section. k8 was able to access gpu and assign pod to the local node. Thanks you your helps. However, I faced another issue due to image build mentioned here . But, I think we are good to close this thread ! Thank you.