Hi I m using flyte sandbox and trying to run some tasks in f Flyte #flyte-support

Hi, I’m using flyte sandbox and trying to run some...

incalculable-ice-13425

08/07/2023, 2:36 PM

Hi, I’m using flyte sandbox and trying to run some tasks in flyte. I have a task that needs to access gpu from my host machine. Since, I have gpu is in my host machine, it needs to be passed on to the k8 pod that’s running inside the docker container (sandbox). However, I see that kubectl is not able to schedule the pod.

Copy code

Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  11m    default-scheduler  0/1 nodes are available: 1 Insufficient <http://nvidia.com/gpu|nvidia.com/gpu>, 1 node(s) didn't match Pod's node affinity/selector. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

I also checked the

kubectl describe node 31e2e2ba6f9f

(

31e2e2ba6f9f

is the flyte-sandbox container in which flyte is running in sandbox environment) Looks like node (docker container in this case) itself doesn’t have gpu to allocate to the pod

Copy code

>kubectl describe node 31e2e2ba6f9f
...
Capacity:
  cpu:                8                                                                                                                                                       ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0                                                                                                                                                   hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491                                                                                                                                        hugepages-1Gi:      0
  hugepages-2Mi:      0                                                                                                                                                   memory:             65774996Ki
  pods:               110
...

I think this is happening because the flyte-sandbox docker container is not able to access gpus. I logged into the container and it doesn’t give output of

nvidia-smi

which is required to work for container to access gpus.

Copy code

❯ docker exec -it 31e2e2ba6f9f /bin/sh
/ # nvidia-smi
/bin/sh: nvidia-smi: not found
/ #

Note that outside of flyte, I’m able to run the task image in a docker container in my host machine as below which needs
--gpus all
to be passed for container to access gpus

Copy code

docker run --gpus all <task_image_with_gpu_req>

I think if we can somehow pass the

--gpus all

to the docker run command for sandbox during

flytectl demo start

command, it should work. Please help !!!

incalculable-ice-13425

08/07/2023, 2:43 PM

Is there any recommended way in which we can bring up flyte on a local machine so that pods can access gpus. I’m not sure if sandbox supports that because of

--gpus all

argument missing to the docker run cmd for flyte-sandbox container.

incalculable-ice-13425

08/07/2023, 6:28 PM

I think it’s nothing to do with

--gpus all

, looks like we need to set up k8s compatibility with gpus using https://github.com/NVIDIA/k8s-device-plugin It’s not working for me atm which looks be the issue.

high-accountant-32689

08/07/2023, 7:24 PM

cc: @freezing-boots-56761

freezing-boots-56761

08/07/2023, 7:26 PM

this isn’t currently supported OOB unfortunately, but there are a few threads where folks have gotten this to work

freezing-boots-56761

08/07/2023, 7:31 PM

@incalculable-ice-13425: This might be informative: https://github.com/flyteorg/flyte/pull/3256

incalculable-ice-13425

08/08/2023, 3:57 AM

@freezing-boots-56761 I did some digging and checked that https://github.com/NVIDIA/k8s-device-plugin only worked with

none

kvm2

drivers as listed here . Please help me answer below questions if you know the answer. • Since, for sandbox, docker driver is used, that’s why it’s not working for me. I’d appreciate if you can refer me to some threads who got this working with sandbox. • Another option is to install it some other way on my local machine to give it a try. Please recommend any other installation method (helm install …) that doesn’t require docker driver for k8s. • Also, for this change https://github.com/flyteorg/flyte/pull/3256, once this is merged, does this mean k8s pod should be able to access gpu if we start with

flyctl demo start

. Is my understanding correct ?

freezing-boots-56761

08/08/2023, 4:04 AM

@incalculable-ice-13425Kumar: i believe the PR comments say to build the image manually and use it as the sandbox image

freezing-boots-56761

08/08/2023, 4:04 AM

i haven’t tried it myself unfortunately

incalculable-ice-13425

08/08/2023, 4:21 AM

Let me try that. Thanks

incalculable-ice-13425

08/08/2023, 7:52 AM

Hi @freezing-boots-56761 I’m trying to build the image like you said. I’m facing the same issue with master repo without this https://github.com/flyteorg/flyte/pull/3256 patch as well. Here’s what I did in the case of master flyte repo. 1. Clone the repo 2.

cd docker/sandbox-bundled/

make build

4. Initially it gives below error

Copy code

Error: lookup_func.go:106: [ERROR] unable to retrieve resource list for: v1 , error: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
lookup_func.go:80: [ERROR] unable to get apiresource from unstructured: /v1, Kind=Secret , error Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused
Error: template: flyte-sandbox/charts/postgresql/templates/secrets.yaml:17:24: executing "flyte-sandbox/charts/postgresql/templates/secrets.yaml" at <include "common.secrets.passwords.manage" (dict "secret" (include "common.names.fullname" .) "key" "postgres-password" "providedValues" (list "global.postgresql.auth.postgresPassword" "auth.postgresPassword") "context" $)>: error calling include: template: flyte-sandbox/charts/minio/charts/common/templates/_secrets.tpl:93:20: executing "common.secrets.passwords.manage" at <lookup "v1" "Secret" (include "common.names.namespace" .context) .secret>: error calling lookup: unable to get apiresource from unstructured: /v1, Kind=Secret: Get <http://localhost:8080/api/v1?timeout=32s>: dial tcp 127.0.0.1:8080: connect: connection refused

5. It was resolve after I started a

minikube cluster

using

minikube start

6. Then, if I run

make build

again, this error comes

Copy code

Error: Error: template: flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:37:35: executing "flyte-sandbox/charts/flyte-binary/templates/deployment.yaml" at <include (print $.Template.BasePath "/configmap.yaml") .>: error calling include: template: flyte-sandbox/charts/flyte-binary/templates/configmap.yaml:196:8: executing "flyte-sandbox/charts/flyte-binary/templates/configmap.yaml" at <tpl (.Values.configuration.inline | toYaml) .>: error calling tpl: error during tpl function execution for "plugins:\n  k8s:\n    default-env-vars:\n    - FLYTE_AWS_ENDPOINT: http://{{ printf \"%!s(MISSING)-minio\" .Release.Name | trunc 63 | trimSuffix\n        \"-\" }}.{{ .Release.Namespace }}:9000\n    - FLYTE_AWS_ACCESS_KEY_ID: minio\n    - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage\nstorage:\n  signedURL:\n    stowConfigOverride:\n      endpoint: <http://localhost:30002>": parse error at (flyte-sandbox/charts/flyte-binary/templates/deployment.yaml:4): unclosed action
: unable to run: 'helm template flyte-sandbox /home/gakumar/projects/flyte/flyte/charts/flyte-sandbox --namespace flyte -f /tmp/kustomize-helm-23569202/flyte-sandbox-kustomize-values.yaml' with env=[HELM_CONFIG_HOME=/tmp/kustomize-helm-23569202/helm HELM_CACHE_HOME=/tmp/kustomize-helm-23569202/helm/.cache HELM_DATA_HOME=/tmp/kustomize-helm-23569202/helm/.data] (is 'helm' installed?): exit status 1
make: *** [Makefile:22: manifests] Error 1

I’m facing same error with and without https://github.com/flyteorg/flyte/pull/3256. Could you please let me know what I am missing here ???

incalculable-ice-13425

08/08/2023, 1:17 PM

@freezing-boots-56761 I was able to make progress to build image by removing

manifests-gpu

from

build-gpu

section. k8 was able to access gpu and assign pod to the local node. Thanks you your helps. However, I faced another issue due to image build mentioned here . But, I think we are good to close this thread ! Thank you.

🙌🏽 1

194 Views

Open in Slack

Previous Next