Hi community is there a way to use gpus in the sandbox clust Flyte #flyte-support

Join Slack

Hi community, is there a way to use gpus in the sa...

# flyte-support

shy-accountant-549

03/14/2023, 8:42 PM

Hi community, is there a way to use gpus in the sandbox cluster?

glamorous-carpet-83516

03/14/2023, 8:44 PM

Here is a pr to build a gpu image for the sandbox.

shy-accountant-549

03/14/2023, 9:55 PM

tried building the image but got the following error. not sure if I missed something

Copy code

=> ERROR [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist                                                                                                                                                                                                            0.0s
------                                                                                                                                                                                                                                                                                          
 > [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist:
------
Dockerfile:15
--------------------
  13 |     RUN go mod download
  14 |     COPY cmd cmd
  15 | >>> COPY --from=flyteconsole /app/dist cmd/single/dist
  16 |     RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/root/go/pkg/mod \
  17 |         go build -tags console -v -o dist/flyte cmd/main.go
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref o9gpa8wxdy93rhm0ln129nnas::jkz102t1mu3qkv65xr1hyjjzi: "/app/dist": not found

shy-accountant-549

03/14/2023, 10:17 PM

besides sandbox, are there other on prem deployment options?

glamorous-carpet-83516

03/14/2023, 10:19 PM

need to update line 15. https://github.com/flyteorg/flyte/commit/401973c72dd59497f03ca6efd2267fe850ccb5c3

👍 1

shy-accountant-549

03/14/2023, 10:24 PM

is there also a limit on memory request in sandbox? I am getting rejections when requesting more than 1Gi of mem

glamorous-carpet-83516

03/14/2023, 10:30 PM

yeah, there is a limit in the flyte-sandbox config map.

Copy code

task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 500m
        memory: 1Gi
        storage: 500Mi
      limits:
        cpu: 2
        gpu: 5
        memory: 4Gi
        storage: 20Mi

shy-accountant-549

03/15/2023, 3:25 AM

is there a way to override that when creating the sandbox cluster?

tall-lock-23197

03/15/2023, 7:39 AM

You can increase the mem values by running

kubectl -n flyte edit cm flyte-admin-base-config

command.

quick-salesclerk-18019

03/15/2023, 12:07 PM

Updated the GPU PR; hopefully it works with a current demo cluster now

shy-accountant-549

03/15/2023, 2:41 PM

thanks! I will give it a try

shy-accountant-549

03/15/2023, 3:03 PM

getting 503 from ghcr. Something wrong on github? 🤔

Copy code

ERROR: failed to solve: <http://ghcr.io/flyteorg/flyteconsole:latest|ghcr.io/flyteorg/flyteconsole:latest>: failed to authorize: failed to fetch anonymous token: unexpected status: 503 Service Unavailable

tall-lock-23197

03/15/2023, 3:03 PM

Oh yeah. GHCR is down. https://www.githubstatus.com/

👀 1

shy-accountant-549

03/15/2023, 3:46 PM

@quick-salesclerk-18019 I was able to build gpu image with updated PR. but the the sandbox container still immediately exited with code 1 (same with docker run). what version of flytectl did you test with? I am on

Copy code

{
  "App": "flytectl",
  "Build": "29da288",
  "Version": "0.6.34",
  "BuildTime": "2023-03-15 10:45:13.597115631 -0500 CDT m=+0.041086554"
}

quick-salesclerk-18019

03/15/2023, 5:44 PM

Maybe it's better to continue the discussion here, rather than in the PR ^_^

👍 1

quick-salesclerk-18019

03/15/2023, 5:45 PM

I'm using the same flytectl...

quick-salesclerk-18019

03/15/2023, 5:48 PM

You can start the container with bash as entrypoint by running

docker run -it --entrypoint bash flyte-sandbox-gpu:latest

and then try to start the cluster with

/bin/k3d-entrypoint.sh

and checking the output in the logs in

/var/log/k3d-entrypoints_$(date "+%y%m%d%H%M%S").log

. However, since 1.4 there is the new bootstrapping functionality, and the k3d-entrypoint script doesn't start the cluster on its own anymore.

quick-salesclerk-18019

03/15/2023, 5:50 PM

flytectl demo start --image flyte-sandbox-gpu:latest

should work however... You could also commit the failed container (the one that exits with code 1) to a new image, start a container based on that with bash as entrypoint, and check the logs from there...

shy-accountant-549

03/15/2023, 5:55 PM

here are the logs:

Copy code

[2023-03-15T17:53:37+00:00] Running k3d entrypoints...
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
2023/03/15 17:53:37 failed to apply transformations: lookup host.docker.internal on 8.8.8.8:53: no such host

quick-salesclerk-18019

03/15/2023, 6:50 PM

yeah, I get that same error when running the entrypoint with v1.4, so it's something else... check with the stopped container after a failed

flytectl demo start

or maybe @freezing-boots-56761 can help?

quick-salesclerk-18019

03/15/2023, 6:55 PM

Aha, inspecting the container from the cluster it seems you need to pass an extra host to the container... try

docker run -it --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest

freezing-boots-56761

03/15/2023, 6:56 PM

flytectl does that for you

quick-salesclerk-18019

03/15/2023, 6:56 PM

then run the entrypoint with parameters like the Dockerfile:

/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb

quick-salesclerk-18019

03/15/2023, 6:58 PM

@freezing-boots-56761 true 🙂 we're trying to debug why flytectl demo start can't start the gpu demo container

freezing-boots-56761

03/15/2023, 6:58 PM

freezing-boots-56761

03/15/2023, 6:59 PM

try with this make target: https://github.com/flyteorg/flyte/blob/b82eaa5c507640a551f942de5c789198d076a491/docker/sandbox-bundled/Makefile#L42

freezing-boots-56761

03/15/2023, 7:00 PM

but instead of —detach use -it so you see logs

👍 1

shy-accountant-549

03/15/2023, 7:05 PM

here is what I did:

Copy code

docker run -it --gpus all --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest

in the container

Copy code

/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb

part of the logs:

Copy code

INFO[0010] certificate CN=system:node:293d72719bcd,O=system:nodes signed by CN=k3s-client-ca@1678906975: notBefore=2023-03-15 19:02:55 +0000 UTC notAfter=2024-03-14 19:03:05 +0000 UTC 
INFO[0010] Waiting to retrieve agent configuration; server is not ready: "overlayfs" snapshotter cannot be enabled for "/var/lib/rancher/k3s/agent/containerd", try using "fuse-overlayfs" or "native": failed to mount overlay: operation not permitted 
INFO[0011] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0012] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0013] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
INFO[0014] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found 
I0315 19:03:10.389468      14 range_allocator.go:83] Sending events to api server.

shy-accountant-549

03/15/2023, 7:06 PM

logs rolling like that forever

shy-accountant-549

03/15/2023, 7:10 PM

/var/log/k3d-entrypoints_230315190251.log

looks fine:

Copy code

[2023-03-15T19:02:51+00:00] Running k3d entrypoints...
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-gpu-check.sh
GPU Enabled - checking if it's available
Wed Mar 15 19:02:52 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:17:00.0 Off |                  N/A |
| 30%   44C    P0    91W / 350W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
nvidia-smi working
[2023-03-15T19:02:52+00:00] Finished k3d entrypoint scripts!

👍 1

quick-salesclerk-18019

03/15/2023, 7:40 PM

Try starting the container in privileged mode, I think that might be needed:

docker run -it --gpus all --privileged --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest

shy-accountant-549

03/15/2023, 7:44 PM

I updated makefile:

Copy code

.PHONY: start
start: FLYTE_SANDBOX_IMAGE := flyte-sandbox-gpu:latest
start: FLYTE_DEV := False
start:
	[ -n "$(shell docker volume ls --filter name=^flyte-sandbox$$ --format {{.Name}})" ] || \
		docker volume create flyte-sandbox
	[ -n "$(shell docker ps --filter name=^flyte-sandbox$$ --format {{.Names}})" ] || \
		docker run -it --rm --privileged --name flyte-sandbox \
			--gpus all \
			--add-host "host.docker.internal:host-gateway" \
			--env FLYTE_DEV=$(FLYTE_DEV) \
			--env K3S_KUBECONFIG_OUTPUT=/.kube/kubeconfig \
			--volume $(PWD)/.kube:/.kube \
			--volume $(HOME)/.flyte/sandbox:/var/lib/flyte/config \
			--volume flyte-sandbox:/var/lib/flyte/storage \
			--publish "6443":"6443" \
			--publish "30000:30000" \
			--publish "30001:30001" \
			--publish "30002:30002" \
			--publish "30080:30080" \
			$(FLYTE_SANDBOX_IMAGE)
		export KUBECONFIG=.kube/kubeconfig

and it is working with

make start

🎉 1

👍 2

quick-salesclerk-18019

03/15/2023, 8:12 PM

Surprising that

flytectl demo start

didn't work then... 🤔

shy-accountant-549

03/15/2023, 8:26 PM

I committed the failed container. Here are log files in

/var/log

Copy code

alternatives.log  apt  bootstrap.log  btmp  dpkg.log  faillog  lastlog  wtmp

do you want to take a look at any of these files?

shy-accountant-549

03/15/2023, 8:28 PM

@freezing-boots-56761 how to set the kubectl context after

make start

make kubeconfig

doesn't do the magic as

flytectl demo start

freezing-boots-56761

03/15/2023, 8:28 PM

source <(make kubeconfig)

quick-salesclerk-18019

03/15/2023, 8:28 PM

@shy-accountant-549 weird there aren't any k3d-entrypoint-logs... not sure to look for in the others :-/

quick-salesclerk-18019

03/15/2023, 8:30 PM

@freezing-boots-56761 Any theories why a container would start with the Makefile command, but not

flytectl demo start --image X

quick-salesclerk-18019

03/15/2023, 8:30 PM

Also, for me

flytectl demo start --image flyte-sandbox-gpu:latest

works...

freezing-boots-56761

03/15/2023, 8:30 PM

if you use the dry-run flag with flytectl it will print out commands

freezing-boots-56761

03/15/2023, 8:30 PM

so maybe can repro that way?

quick-salesclerk-18019

03/15/2023, 8:31 PM

Aha, thanks - Please give it a go @shy-accountant-549 🙂

shy-accountant-549

03/15/2023, 8:42 PM

Copy code

❇️ Run the following command to create new sandbox container
        docker create --privileged -p 0.0.0.0:30000:30000 -p 0.0.0.0:30001:30001 -p 0.0.0.0:30002:30002 -p 0.0.0.0:6443:6443 -p 0.0.0.0:30080:30080 --env SANDBOX=1 --env KUBERNETES_API_PORT=30086 --env FLYTE_HOST=localhost:30081 --env FLYTE_AWS_ENDPOINT=<http://localhost:30084> --env K3S_KUBECONFIG_OUTPUT=/var/lib/flyte/config/kubeconfig --mount type=bind,source=/home/nan/.flyte,target=/etc/rancher/ --mount type=bind,source=/home/nan/.flyte/sandbox,target=/var/lib/flyte/config --mount type=volume,source=flyte-sandbox,target=/var/lib/flyte/storage --name flyte-sandbox flyte-sandbox-gpu:latest

doesn't have

--add-host "host.docker.internal:host-gateway"

freezing-boots-56761

03/15/2023, 8:43 PM

hmm that might just be a bug in rendering the line. is everything else the same?

freezing-boots-56761

03/15/2023, 8:44 PM

it does work for @quick-salesclerk-18019 so it’s likely an issue with your local setup @shy-accountant-549

freezing-boots-56761

03/15/2023, 8:44 PM

not sure what though

🤔 1

shy-accountant-549

03/15/2023, 8:50 PM

hmm that might just be a bug in rendering the line. is everything else the same?

the same as the make target?

freezing-boots-56761

03/15/2023, 8:51 PM

right. you can try running the command directly with the add-host arg

shy-accountant-549

03/15/2023, 8:59 PM

yeah I added

--add-host host.docker.internal:host-gateway --gpus all

and it works

freezing-boots-56761

03/15/2023, 8:59 PM

hmm

freezing-boots-56761

03/15/2023, 9:00 PM

@thankful-minister-83577: cc

quick-salesclerk-18019

03/15/2023, 9:25 PM

Just to verify if it's really missing @shy-accountant-549, you can

docker inspect

the failed container from flytectl and check if it has a section such as this one

"ExtraHosts": [

"host.docker.internal:host-gateway"

]

quick-salesclerk-18019

03/15/2023, 9:26 PM

I have that one even though --add-host is missing from the dryRun output

shy-accountant-549

03/15/2023, 9:53 PM

yeah the failed one has it

Copy code

"ExtraHosts": [
            "host.docker.internal:host-gateway"
        ],

👍 1

quick-salesclerk-18019

03/16/2023, 6:30 AM

Focusing the the

--gpus all

... I realise now that I have a non-standard flag in my docker-file needed for gpus to be passed to docker build.

quick-salesclerk-18019

03/16/2023, 6:32 AM

/etc/docker/daemon.json:

Copy code

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

quick-salesclerk-18019

03/16/2023, 6:32 AM

Without

"default-runtime": "nvidia",

it doesn't start with the

flytectl demo start

for me either.

🙌 1

quick-salesclerk-18019

03/16/2023, 6:33 AM

Check your default runtime with

docker info|grep -i runtime

quick-salesclerk-18019

03/16/2023, 6:38 AM

I have an error message that should notify the user, but I guess it is hidden by the flytectl tool... Would've been better to put it in the logs. 😕

freezing-boots-56761

03/16/2023, 6:54 AM

i was curious about —gpus all as well

freezing-boots-56761

03/16/2023, 6:55 AM

nice find @quick-salesclerk-18019

quick-salesclerk-18019

03/16/2023, 7:18 AM

Wouldn't have found it without your help @freezing-boots-56761 - it was a group effort 🙂 Hope it works for @shy-accountant-549

shy-accountant-549

03/16/2023, 2:44 PM

yes, after setting the default runtime to nvidia, flytectl demo start works for me.

shy-accountant-549

03/16/2023, 2:44 PM

big thanks to @quick-salesclerk-18019 and @freezing-boots-56761

👍 1

shy-accountant-549

03/17/2023, 11:42 PM

@tall-lock-23197 I was trying to change the cpu/memory/storage limit as you suggested. but there is no cm called

flyte-admin-base-config

. Below are all the cms in all namespaces:

Copy code

kube-system               extension-apiserver-authentication               6      3h52m
kube-system               cluster-dns                                      2      3h52m
flyte                     flyte-sandbox-cluster-resource-templates         1      3h52m
flyte                     flyte-sandbox-config                             5      3h52m
flyte                     flyte-sandbox-docker-registry-config             1      3h52m
flyte                     flyte-sandbox-extra-cluster-resource-templates   0      3h52m
flyte                     flyte-sandbox-extra-config                       0      3h52m
flyte                     flyte-sandbox-proxy-config                       1      3h52m
flyte                     kubernetes-dashboard-settings                    0      3h52m
kube-system               chart-content-nvidia-device-plugin               0      3h52m
kube-system               chart-values-nvidia-device-plugin                0      3h52m
kube-system               local-path-config                                4      3h52m
flyte                     kube-root-ca.crt                                 1      3h52m
kube-system               kube-root-ca.crt                                 1      3h52m
default                   kube-root-ca.crt                                 1      3h52m
kube-public               kube-root-ca.crt                                 1      3h52m
kube-node-lease           kube-root-ca.crt                                 1      3h52m
kube-system               coredns                                          2      3h52m
flytesnacks-development   kube-root-ca.crt                                 1      3h51m
flytesnacks-staging       kube-root-ca.crt                                 1      3h51m
flytesnacks-production    kube-root-ca.crt                                 1      3h51m

shy-accountant-549

03/17/2023, 11:44 PM

I also searched through all the CMs and didn't find anything related to cpu/memory/storage limits. Any idea to change those limits?

freezing-boots-56761

03/17/2023, 11:51 PM

we need better docs for this: https://flyte-org.slack.com/archives/CP2HDHKE1/p1678454867658579?thread_ts=1677247518.480459&cid=CP2HDHKE1

freezing-boots-56761

03/17/2023, 11:51 PM

@thankful-minister-83577: cc

freezing-boots-56761

03/17/2023, 11:53 PM

@shy-accountant-549 : add:

Copy code

task_resources:
  defaults:
    cpu: 1
    memory: 2Gi
  limits:
    cpu: 4
    memory: 8Gi

or equivalent to

~/.flyte/sandbox/config.yaml

and run

flytectl demo reload

freezing-boots-56761

03/17/2023, 11:53 PM

wait a bit for pod to reconcile and restart on its own

👍 1

shy-accountant-549

03/18/2023, 2:19 AM

@freezing-boots-56761 is there a way to increase /dev/shm in the pod? Pytorch dataloaders easily run out of shared memory as in this issue. Below is output of df -h from the pod

Copy code

overlay         1.8T  932G  808G  54% /
tmpfs            64M     0   64M   0% /dev
tmpfs            32G     0   32G   0% /sys/fs/cgroup
/dev/nvme0n1p2  1.8T  932G  808G  54% /etc/hosts
shm              64M  8.0K   64M   1% /dev/shm
tmpfs            32G   12K   32G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            32G   12K   32G   1% /proc/driver/nvidia
tmpfs            32G     0   32G   0% /proc/acpi
tmpfs            32G     0   32G   0% /proc/scsi
tmpfs            32G     0   32G   0% /sys/firmware

shy-accountant-549

03/18/2023, 2:37 AM

https://flyte-org.slack.com/archives/CP2HDHKE1/p1676661684505849?thread_ts=1676648061.967599&cid=CP2HDHKE1 and update the cm to use podtemplate?

freezing-boots-56761

03/18/2023, 3:12 AM

yea using a pod template seems like the best bet. there is no easier way to add volumes to task pods outside of using sidecar tasks. but not sure how this works within docker though.

shy-accountant-549

03/18/2023, 3:40 AM

hmm, it didn't pick up the podTemplate. Here is what I did 1. kubectl -n flyte edit cm flyte-sandbox-config, add template-name so it is now

Copy code

010-inline-config.yaml: |
    plugins:
      k8s:
        default-env-vars:
        - FLYTE_AWS_ENDPOINT: <http://flyte-sandbox-minio.flyte:9000>
        - FLYTE_AWS_ACCESS_KEY_ID: minio
        - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
        default-pod-template-name: flyte-template

2. kubectl apply -f podTemplate.yaml which is

Copy code

apiVersion: v1
kind: PodTemplate
metadata:
  name: flyte-template
  namespace: flyte
template:
  spec:
    volumes:
    - name: dshm
      emptyDir:
        medium: Memory
        sizeLimit: 64000Mi
    containers:
      - name: default
        image: <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        env:
          - name: FOO
            value: BAR

3. wait for a few mins and start a workflow. Neither the volume nor the env vars are in the task container.

shy-accountant-549

03/18/2023, 3:40 AM

Did I miss something?

freezing-boots-56761

03/18/2023, 3:59 AM

you don’t have to modify the configmap directly

shy-accountant-549

03/18/2023, 4:00 AM

also tried setting pod_template_name in @task as in this pr, but got error:

Copy code

Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist

however I have the podTemplate in both

flyte

and

flytesnacks-development

freezing-boots-56761

03/18/2023, 4:00 AM

just add it to ~/.flyte/sandbox/config.yaml

freezing-boots-56761

03/18/2023, 4:01 AM

did you already create that file @shy-accountant-549

shy-accountant-549

03/18/2023, 4:01 AM

~/.flyte/sandbox/config.yaml

freezing-boots-56761

03/18/2023, 4:01 AM

yea

shy-accountant-549

03/18/2023, 4:02 AM

should it look like this?

Copy code

task_resources:
  defaults:
    cpu: 1
    memory: 2Gi
    storage: 32Gi
  limits:
    cpu: 8
    memory: 128Gi
    storage: 512Gi
inline:
  plugins:
    k8s:
      default-pod-template-name: flyte-template

freezing-boots-56761

03/18/2023, 4:02 AM

yes

freezing-boots-56761

03/18/2023, 4:02 AM

drop the “inline:” and indent “plugins:” left

shy-accountant-549

03/18/2023, 4:03 AM

ok, reloaded

shy-accountant-549

03/18/2023, 4:04 AM

still getting this

Copy code

Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist

shy-accountant-549

03/18/2023, 4:04 AM

I have the podTemplates as

Copy code

NAMESPACE                 NAME             CONTAINERS   IMAGES                         POD LABELS
flyte                     flyte-template   default      <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>   <none>
flytesnacks-development   flyte-template   default      <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>   <none>

freezing-boots-56761

03/18/2023, 4:05 AM

did the flyte-sandbox deployment restart?

freezing-boots-56761

03/18/2023, 4:05 AM

did you “flytectl demo reload”?

shy-accountant-549

03/18/2023, 4:05 AM

yes, reload

freezing-boots-56761

03/18/2023, 4:05 AM

ok. did the pod restart and pick up the changes?

shy-accountant-549

03/18/2023, 4:07 AM

still getting this

Copy code

Workflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist

shy-accountant-549

03/18/2023, 4:07 AM

let me try again

shy-accountant-549

03/18/2023, 4:08 AM

it works now

shy-accountant-549

03/18/2023, 4:10 AM

you are my hero! @freezing-boots-56761

❤️ 1

millions-table-8574

04/03/2023, 7:09 AM

very helpful thread - helped me get going on testing using a GPU on the demo sandbox. thanks @shy-accountant-549 for raising this, @quick-salesclerk-18019@freezing-boots-56761for the guidance. looking forward to when we get all this as part of the official image and docs

👍 3

179 Views

Open in Slack

Previous Next