Hi community, is there a way to use gpus in the sa...
# ask-the-community
n
Hi community, is there a way to use gpus in the sandbox cluster?
k
Here is a pr to build a gpu image for the sandbox.
n
tried building the image but got the following error. not sure if I missed something
Copy code
=> ERROR [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist                                                                                                                                                                                                            0.0s
------                                                                                                                                                                                                                                                                                          
 > [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist:
------
Dockerfile:15
--------------------
  13 |     RUN go mod download
  14 |     COPY cmd cmd
  15 | >>> COPY --from=flyteconsole /app/dist cmd/single/dist
  16 |     RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/root/go/pkg/mod \
  17 |         go build -tags console -v -o dist/flyte cmd/main.go
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref o9gpa8wxdy93rhm0ln129nnas::jkz102t1mu3qkv65xr1hyjjzi: "/app/dist": not found
besides sandbox, are there other on prem deployment options?
n
is there also a limit on memory request in sandbox? I am getting rejections when requesting more than 1Gi of mem
k
yeah, there is a limit in the flyte-sandbox config map.
Copy code
task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 500m
        memory: 1Gi
        storage: 500Mi
      limits:
        cpu: 2
        gpu: 5
        memory: 4Gi
        storage: 20Mi
n
is there a way to override that when creating the sandbox cluster?
s
You can increase the mem values by running
kubectl -n flyte edit cm flyte-admin-base-config
command.
b
Updated the GPU PR; hopefully it works with a current demo cluster now
n
thanks! I will give it a try
getting 503 from ghcr. Something wrong on github? 🤔
Copy code
ERROR: failed to solve: <http://ghcr.io/flyteorg/flyteconsole:latest|ghcr.io/flyteorg/flyteconsole:latest>: failed to authorize: failed to fetch anonymous token: unexpected status: 503 Service Unavailable
s
Oh yeah. GHCR is down. https://www.githubstatus.com/
n
@Björn I was able to build gpu image with updated PR. but the the sandbox container still immediately exited with code 1 (same with docker run). what version of flytectl did you test with? I am on
Copy code
{
  "App": "flytectl",
  "Build": "29da288",
  "Version": "0.6.34",
  "BuildTime": "2023-03-15 10:45:13.597115631 -0500 CDT m=+0.041086554"
}
b
Maybe it's better to continue the discussion here, rather than in the PR ^_^
I'm using the same flytectl...
You can start the container with bash as entrypoint by running
docker run -it --entrypoint bash flyte-sandbox-gpu:latest
and then try to start the cluster with
/bin/k3d-entrypoint.sh
and checking the output in the logs in
/var/log/k3d-entrypoints_$(date "+%y%m%d%H%M%S").log
. However, since 1.4 there is the new bootstrapping functionality, and the k3d-entrypoint script doesn't start the cluster on its own anymore.
flytectl demo start --image flyte-sandbox-gpu:latest
should work however... You could also commit the failed container (the one that exits with code 1) to a new image, start a container based on that with bash as entrypoint, and check the logs from there...
n
here are the logs:
Copy code
[2023-03-15T17:53:37+00:00] Running k3d entrypoints...
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
2023/03/15 17:53:37 failed to apply transformations: lookup host.docker.internal on 8.8.8.8:53: no such host
b
yeah, I get that same error when running the entrypoint with v1.4, so it's something else... check with the stopped container after a failed
flytectl demo start
or maybe @jeev can help?
Aha, inspecting the container from the cluster it seems you need to pass an extra host to the container... try
docker run -it --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
j
flytectl does that for you
b
then run the entrypoint with parameters like the Dockerfile:
/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb
@jeev true 🙂 we're trying to debug why flytectl demo start can't start the gpu demo container
j
ah
but instead of —detach use -it so you see logs
n
here is what I did:
Copy code
docker run -it --gpus all --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
in the container
Copy code
/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb
part of the logs:
Copy code
INFO[0010] certificate CN=system:node:293d72719bcd,O=system:nodes signed by CN=k3s-client-ca@1678906975: notBefore=2023-03-15 19:02:55 +0000 UTC notAfter=2024-03-14 19:03:05 +0000 UTC 
INFO[0010] Waiting to retrieve agent configuration; server is not ready: "overlayfs" snapshotter cannot be enabled for "/var/lib/rancher/k3s/agent/containerd", try using "fuse-overlayfs" or "native": failed to mount overlay: operation not permitted 
INFO[0011] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0012] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0013] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0014] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
I0315 19:03:10.389468      14 range_allocator.go:83] Sending events to api server.
logs rolling like that forever
/var/log/k3d-entrypoints_230315190251.log
looks fine:
Copy code
[2023-03-15T19:02:51+00:00] Running k3d entrypoints...
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-gpu-check.sh
GPU Enabled - checking if it's available
Wed Mar 15 19:02:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:17:00.0 Off |                  N/A |
| 30%   44C    P0    91W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvidia-smi working
[2023-03-15T19:02:52+00:00] Finished k3d entrypoint scripts!
b
Try starting the container in privileged mode, I think that might be needed:
docker run -it --gpus all --privileged --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
n
I updated makefile:
Copy code
.PHONY: start
start: FLYTE_SANDBOX_IMAGE := flyte-sandbox-gpu:latest
start: FLYTE_DEV := False
start:
	[ -n "$(shell docker volume ls --filter name=^flyte-sandbox$$ --format {{.Name}})" ] || \
		docker volume create flyte-sandbox
	[ -n "$(shell docker ps --filter name=^flyte-sandbox$$ --format {{.Names}})" ] || \
		docker run -it --rm --privileged --name flyte-sandbox \
			--gpus all \
			--add-host "host.docker.internal:host-gateway" \
			--env FLYTE_DEV=$(FLYTE_DEV) \
			--env K3S_KUBECONFIG_OUTPUT=/.kube/kubeconfig \
			--volume $(PWD)/.kube:/.kube \
			--volume $(HOME)/.flyte/sandbox:/var/lib/flyte/config \
			--volume flyte-sandbox:/var/lib/flyte/storage \
			--publish "6443":"6443" \
			--publish "30000:30000" \
			--publish "30001:30001" \
			--publish "30002:30002" \
			--publish "30080:30080" \
			$(FLYTE_SANDBOX_IMAGE)
		export KUBECONFIG=.kube/kubeconfig
and it is working with
make start
!
b
Surprising that
flytectl demo start
didn't work then... 🤔
n
I committed the failed container. Here are log files in
/var/log
Copy code
alternatives.log  apt  bootstrap.log  btmp  dpkg.log  faillog  lastlog  wtmp
do you want to take a look at any of these files?
@jeev how to set the kubectl context after
make start
?
make kubeconfig
doesn't do the magic as
flytectl demo start
j
source <(make kubeconfig)
b
@Nan Qin weird there aren't any k3d-entrypoint-logs... not sure to look for in the others :-/
@jeev Any theories why a container would start with the Makefile command, but not
flytectl demo start --image X
?
Also, for me
flytectl demo start --image flyte-sandbox-gpu:latest
works...
j
if you use the dry-run flag with flytectl it will print out commands
so maybe can repro that way?
b
Aha, thanks - Please give it a go @Nan Qin 🙂
n
Copy code
❇️ Run the following command to create new sandbox container
        docker create --privileged -p 0.0.0.0:30000:30000 -p 0.0.0.0:30001:30001 -p 0.0.0.0:30002:30002 -p 0.0.0.0:6443:6443 -p 0.0.0.0:30080:30080 --env SANDBOX=1 --env KUBERNETES_API_PORT=30086 --env FLYTE_HOST=localhost:30081 --env FLYTE_AWS_ENDPOINT=<http://localhost:30084> --env K3S_KUBECONFIG_OUTPUT=/var/lib/flyte/config/kubeconfig --mount type=bind,source=/home/nan/.flyte,target=/etc/rancher/ --mount type=bind,source=/home/nan/.flyte/sandbox,target=/var/lib/flyte/config --mount type=volume,source=flyte-sandbox,target=/var/lib/flyte/storage --name flyte-sandbox flyte-sandbox-gpu:latest
doesn't have
--add-host "host.docker.internal:host-gateway"
j
hmm that might just be a bug in rendering the line. is everything else the same?
it does work for @Björn so it’s likely an issue with your local setup @Nan Qin
not sure what though
n
hmm that might just be a bug in rendering the line. is everything else the same?
the same as the make target?
j
right. you can try running the command directly with the add-host arg
n
yeah I added
--add-host host.docker.internal:host-gateway --gpus all
and it works
j
hmm
@Yee: cc
b
Just to verify if it's really missing @Nan Qin, you can
docker inspect
the failed container from flytectl and check if it has a section such as this one
"ExtraHosts": [
"host.docker.internal:host-gateway"
]
I have that one even though --add-host is missing from the dryRun output
n
yeah the failed one has it
Copy code
"ExtraHosts": [
            "host.docker.internal:host-gateway"
        ],
b
Focusing the the
--gpus all
... I realise now that I have a non-standard flag in my docker-file needed for gpus to be passed to docker build.
/etc/docker/daemon.json:
Copy code
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
Without
"default-runtime": "nvidia",
it doesn't start with the
flytectl demo start
for me either.
Check your default runtime with
docker info|grep -i runtime
I have an error message that should notify the user, but I guess it is hidden by the flytectl tool... Would've been better to put it in the logs. 😕
j
i was curious about —gpus all as well
nice find @Björn
b
Wouldn't have found it without your help @jeev - it was a group effort 🙂 Hope it works for @Nan Qin
n
yes, after setting the default runtime to nvidia, flytectl demo start works for me.
big thanks to @Björn and @jeev
@Samhita Alla I was trying to change the cpu/memory/storage limit as you suggested. but there is no cm called
flyte-admin-base-config
. Below are all the cms in all namespaces:
Copy code
kube-system               extension-apiserver-authentication               6      3h52m
kube-system               cluster-dns                                      2      3h52m
flyte                     flyte-sandbox-cluster-resource-templates         1      3h52m
flyte                     flyte-sandbox-config                             5      3h52m
flyte                     flyte-sandbox-docker-registry-config             1      3h52m
flyte                     flyte-sandbox-extra-cluster-resource-templates   0      3h52m
flyte                     flyte-sandbox-extra-config                       0      3h52m
flyte                     flyte-sandbox-proxy-config                       1      3h52m
flyte                     kubernetes-dashboard-settings                    0      3h52m
kube-system               chart-content-nvidia-device-plugin               0      3h52m
kube-system               chart-values-nvidia-device-plugin                0      3h52m
kube-system               local-path-config                                4      3h52m
flyte                     kube-root-ca.crt                                 1      3h52m
kube-system               kube-root-ca.crt                                 1      3h52m
default                   kube-root-ca.crt                                 1      3h52m
kube-public               kube-root-ca.crt                                 1      3h52m
kube-node-lease           kube-root-ca.crt                                 1      3h52m
kube-system               coredns                                          2      3h52m
flytesnacks-development   kube-root-ca.crt                                 1      3h51m
flytesnacks-staging       kube-root-ca.crt                                 1      3h51m
flytesnacks-production    kube-root-ca.crt                                 1      3h51m
I also searched through all the CMs and didn't find anything related to cpu/memory/storage limits. Any idea to change those limits?
@Yee: cc
@Nan Qin : add:
Copy code
task_resources:
  defaults:
    cpu: 1
    memory: 2Gi
  limits:
    cpu: 4
    memory: 8Gi
or equivalent to
~/.flyte/sandbox/config.yaml
and run
flytectl demo reload
wait a bit for pod to reconcile and restart on its own
n
@jeev is there a way to increase /dev/shm in the pod? Pytorch dataloaders easily run out of shared memory as in this issue. Below is output of df -h from the pod
Copy code
overlay         1.8T  932G  808G  54% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/nvme0n1p2  1.8T  932G  808G  54% /etc/hosts
shm              64M  8.0K   64M   1% /dev/shm
tmpfs            32G   12K   32G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            32G   12K   32G   1% /proc/driver/nvidia
tmpfs            32G     0   32G   0% /proc/acpi
tmpfs            32G     0   32G   0% /proc/scsi
tmpfs            32G     0   32G   0% /sys/firmware
j
yea using a pod template seems like the best bet. there is no easier way to add volumes to task pods outside of using sidecar tasks. but not sure how this works within docker though.
n
hmm, it didn't pick up the podTemplate. Here is what I did 1. kubectl -n flyte edit cm flyte-sandbox-config, add template-name so it is now
Copy code
010-inline-config.yaml: |
    plugins:
      k8s:
        default-env-vars:
        - FLYTE_AWS_ENDPOINT: <http://flyte-sandbox-minio.flyte:9000>
        - FLYTE_AWS_ACCESS_KEY_ID: minio
        - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
        default-pod-template-name: flyte-template
2. kubectl apply -f podTemplate.yaml which is
Copy code
apiVersion: v1
kind: PodTemplate
metadata:
  name: flyte-template
  namespace: flyte
template:
  spec:
    volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 64000Mi
    containers:
      - name: default
        image: <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        env:
          - name: FOO
            value: BAR
3. wait for a few mins and start a workflow. Neither the volume nor the env vars are in the task container.
Did I miss something?
j
you don’t have to modify the configmap directly
n
also tried setting pod_template_name in @task as in this pr, but got error:
Copy code
Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
however I have the podTemplate in both
flyte
and
flytesnacks-development
ns
j
just add it to ~/.flyte/sandbox/config.yaml
did you already create that file @Nan Qin
n
~/.flyte/sandbox/config.yaml
?
j
yea
n
should it look like this?
Copy code
task_resources:
  defaults:
    cpu: 1
    memory: 2Gi
    storage: 32Gi
  limits:
    cpu: 8
    memory: 128Gi
    storage: 512Gi
inline:
  plugins:
    k8s:
      default-pod-template-name: flyte-template
j
yes
drop the “inline:” and indent “plugins:” left
n
ok, reloaded
still getting this
Copy code
Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
I have the podTemplates as
Copy code
NAMESPACE                 NAME             CONTAINERS   IMAGES                         POD LABELS
flyte                     flyte-template   default      <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>   <none>
flytesnacks-development   flyte-template   default      <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>   <none>
j
did the flyte-sandbox deployment restart?
did you “flytectl demo reload”?
n
yes, reload
j
ok. did the pod restart and pick up the changes?
n
still getting this
Copy code
Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
let me try again
it works now
you are my hero! @jeev
b
very helpful thread - helped me get going on testing using a GPU on the demo sandbox. thanks @Nan Qin for raising this, @Björn@jeevfor the guidance. looking forward to when we get all this as part of the official image and docs
110 Views