Nan Qin
03/14/2023, 8:42 PMKevin Su
03/14/2023, 8:44 PMNan Qin
03/14/2023, 9:55 PM=> ERROR [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist 0.0s
------
> [flytebuilder 6/7] COPY --from=flyteconsole /app/dist cmd/single/dist:
------
Dockerfile:15
--------------------
13 | RUN go mod download
14 | COPY cmd cmd
15 | >>> COPY --from=flyteconsole /app/dist cmd/single/dist
16 | RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/root/go/pkg/mod \
17 | go build -tags console -v -o dist/flyte cmd/main.go
--------------------
ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref o9gpa8wxdy93rhm0ln129nnas::jkz102t1mu3qkv65xr1hyjjzi: "/app/dist": not found
Kevin Su
03/14/2023, 10:19 PMNan Qin
03/14/2023, 10:24 PMKevin Su
03/14/2023, 10:30 PMtask_resource_defaults.yaml: |
task_resources:
defaults:
cpu: 500m
memory: 1Gi
storage: 500Mi
limits:
cpu: 2
gpu: 5
memory: 4Gi
storage: 20Mi
Nan Qin
03/15/2023, 3:25 AMSamhita Alla
03/15/2023, 7:39 AMkubectl -n flyte edit cm flyte-admin-base-config
command.Björn
03/15/2023, 12:07 PMNan Qin
03/15/2023, 2:41 PMERROR: failed to solve: <http://ghcr.io/flyteorg/flyteconsole:latest|ghcr.io/flyteorg/flyteconsole:latest>: failed to authorize: failed to fetch anonymous token: unexpected status: 503 Service Unavailable
Samhita Alla
03/15/2023, 3:03 PMNan Qin
03/15/2023, 3:46 PM{
"App": "flytectl",
"Build": "29da288",
"Version": "0.6.34",
"BuildTime": "2023-03-15 10:45:13.597115631 -0500 CDT m=+0.041086554"
}
Björn
03/15/2023, 5:44 PMdocker run -it --entrypoint bash flyte-sandbox-gpu:latest
and then try to start the cluster with /bin/k3d-entrypoint.sh
and checking the output in the logs in /var/log/k3d-entrypoints_$(date "+%y%m%d%H%M%S").log
. However, since 1.4 there is the new bootstrapping functionality, and the k3d-entrypoint script doesn't start the cluster on its own anymore.flytectl demo start --image flyte-sandbox-gpu:latest
should work however... You could also commit the failed container (the one that exits with code 1) to a new image, start a container based on that with bash as entrypoint, and check the logs from there...Nan Qin
03/15/2023, 5:55 PM[2023-03-15T17:53:37+00:00] Running k3d entrypoints...
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T17:53:37+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
2023/03/15 17:53:37 failed to apply transformations: lookup host.docker.internal on 8.8.8.8:53: no such host
Björn
03/15/2023, 6:50 PMflytectl demo start
or maybe @jeev can help?docker run -it --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
jeev
03/15/2023, 6:56 PMBjörn
03/15/2023, 6:56 PM/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb
jeev
03/15/2023, 6:58 PMNan Qin
03/15/2023, 7:05 PMdocker run -it --gpus all --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
in the container
/bin/k3d-entrypoint.sh server --disable=traefik --disable=servicelb
part of the logs:
INFO[0010] certificate CN=system:node:293d72719bcd,O=system:nodes signed by CN=k3s-client-ca@1678906975: notBefore=2023-03-15 19:02:55 +0000 UTC notAfter=2024-03-14 19:03:05 +0000 UTC
INFO[0010] Waiting to retrieve agent configuration; server is not ready: "overlayfs" snapshotter cannot be enabled for "/var/lib/rancher/k3s/agent/containerd", try using "fuse-overlayfs" or "native": failed to mount overlay: operation not permitted
INFO[0011] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found
INFO[0012] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found
INFO[0013] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found
INFO[0014] Waiting for control-plane node 293d72719bcd startup: nodes "293d72719bcd" not found
I0315 19:03:10.389468 14 range_allocator.go:83] Sending events to api server.
/var/log/k3d-entrypoints_230315190251.log
looks fine:
[2023-03-15T19:02:51+00:00] Running k3d entrypoints...
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-cgroupv2.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-flyte-sandbox-bootstrap.sh
[2023-03-15T19:02:51+00:00] Running /bin/k3d-entrypoint-gpu-check.sh
GPU Enabled - checking if it's available
Wed Mar 15 19:02:52 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:17:00.0 Off | N/A |
| 30% 44C P0 91W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvidia-smi working
[2023-03-15T19:02:52+00:00] Finished k3d entrypoint scripts!
Björn
03/15/2023, 7:40 PMdocker run -it --gpus all --privileged --entrypoint bash --add-host host.docker.internal:host-gateway flyte-sandbox-gpu:latest
Nan Qin
03/15/2023, 7:44 PM.PHONY: start
start: FLYTE_SANDBOX_IMAGE := flyte-sandbox-gpu:latest
start: FLYTE_DEV := False
start:
[ -n "$(shell docker volume ls --filter name=^flyte-sandbox$$ --format {{.Name}})" ] || \
docker volume create flyte-sandbox
[ -n "$(shell docker ps --filter name=^flyte-sandbox$$ --format {{.Names}})" ] || \
docker run -it --rm --privileged --name flyte-sandbox \
--gpus all \
--add-host "host.docker.internal:host-gateway" \
--env FLYTE_DEV=$(FLYTE_DEV) \
--env K3S_KUBECONFIG_OUTPUT=/.kube/kubeconfig \
--volume $(PWD)/.kube:/.kube \
--volume $(HOME)/.flyte/sandbox:/var/lib/flyte/config \
--volume flyte-sandbox:/var/lib/flyte/storage \
--publish "6443":"6443" \
--publish "30000:30000" \
--publish "30001:30001" \
--publish "30002:30002" \
--publish "30080:30080" \
$(FLYTE_SANDBOX_IMAGE)
export KUBECONFIG=.kube/kubeconfig
and it is working with make start
!Björn
03/15/2023, 8:12 PMflytectl demo start
didn't work then... 🤔Nan Qin
03/15/2023, 8:26 PM/var/log
alternatives.log apt bootstrap.log btmp dpkg.log faillog lastlog wtmp
do you want to take a look at any of these files?make start
? make kubeconfig
doesn't do the magic as flytectl demo start
jeev
03/15/2023, 8:28 PMBjörn
03/15/2023, 8:28 PMflytectl demo start --image X
?flytectl demo start --image flyte-sandbox-gpu:latest
works...jeev
03/15/2023, 8:30 PMBjörn
03/15/2023, 8:31 PMNan Qin
03/15/2023, 8:42 PM❇️ Run the following command to create new sandbox container
docker create --privileged -p 0.0.0.0:30000:30000 -p 0.0.0.0:30001:30001 -p 0.0.0.0:30002:30002 -p 0.0.0.0:6443:6443 -p 0.0.0.0:30080:30080 --env SANDBOX=1 --env KUBERNETES_API_PORT=30086 --env FLYTE_HOST=localhost:30081 --env FLYTE_AWS_ENDPOINT=<http://localhost:30084> --env K3S_KUBECONFIG_OUTPUT=/var/lib/flyte/config/kubeconfig --mount type=bind,source=/home/nan/.flyte,target=/etc/rancher/ --mount type=bind,source=/home/nan/.flyte/sandbox,target=/var/lib/flyte/config --mount type=volume,source=flyte-sandbox,target=/var/lib/flyte/storage --name flyte-sandbox flyte-sandbox-gpu:latest
doesn't have --add-host "host.docker.internal:host-gateway"
jeev
03/15/2023, 8:43 PMNan Qin
03/15/2023, 8:50 PMhmm that might just be a bug in rendering the line. is everything else the same?the same as the make target?
jeev
03/15/2023, 8:51 PMNan Qin
03/15/2023, 8:59 PM--add-host host.docker.internal:host-gateway --gpus all
and it worksjeev
03/15/2023, 8:59 PMBjörn
03/15/2023, 9:25 PMdocker inspect
the failed container from flytectl and check if it has a section such as this one
"ExtraHosts": [
"host.docker.internal:host-gateway"
]
Nan Qin
03/15/2023, 9:53 PM"ExtraHosts": [
"host.docker.internal:host-gateway"
],
Björn
03/16/2023, 6:30 AM--gpus all
... I realise now that I have a non-standard flag in my docker-file needed for gpus to be passed to docker build.{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
"default-runtime": "nvidia",
it doesn't start with the flytectl demo start
for me either.docker info|grep -i runtime
jeev
03/16/2023, 6:54 AMBjörn
03/16/2023, 7:18 AMNan Qin
03/16/2023, 2:44 PMflyte-admin-base-config
. Below are all the cms in all namespaces:
kube-system extension-apiserver-authentication 6 3h52m
kube-system cluster-dns 2 3h52m
flyte flyte-sandbox-cluster-resource-templates 1 3h52m
flyte flyte-sandbox-config 5 3h52m
flyte flyte-sandbox-docker-registry-config 1 3h52m
flyte flyte-sandbox-extra-cluster-resource-templates 0 3h52m
flyte flyte-sandbox-extra-config 0 3h52m
flyte flyte-sandbox-proxy-config 1 3h52m
flyte kubernetes-dashboard-settings 0 3h52m
kube-system chart-content-nvidia-device-plugin 0 3h52m
kube-system chart-values-nvidia-device-plugin 0 3h52m
kube-system local-path-config 4 3h52m
flyte kube-root-ca.crt 1 3h52m
kube-system kube-root-ca.crt 1 3h52m
default kube-root-ca.crt 1 3h52m
kube-public kube-root-ca.crt 1 3h52m
kube-node-lease kube-root-ca.crt 1 3h52m
kube-system coredns 2 3h52m
flytesnacks-development kube-root-ca.crt 1 3h51m
flytesnacks-staging kube-root-ca.crt 1 3h51m
flytesnacks-production kube-root-ca.crt 1 3h51m
jeev
03/17/2023, 11:51 PMtask_resources:
defaults:
cpu: 1
memory: 2Gi
limits:
cpu: 4
memory: 8Gi
or equivalent to ~/.flyte/sandbox/config.yaml
and run flytectl demo reload
Nan Qin
03/18/2023, 2:19 AMoverlay 1.8T 932G 808G 54% /
tmpfs 64M 0 64M 0% /dev
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/nvme0n1p2 1.8T 932G 808G 54% /etc/hosts
shm 64M 8.0K 64M 1% /dev/shm
tmpfs 32G 12K 32G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 32G 12K 32G 1% /proc/driver/nvidia
tmpfs 32G 0 32G 0% /proc/acpi
tmpfs 32G 0 32G 0% /proc/scsi
tmpfs 32G 0 32G 0% /sys/firmware
jeev
03/18/2023, 3:12 AMNan Qin
03/18/2023, 3:40 AM010-inline-config.yaml: |
plugins:
k8s:
default-env-vars:
- FLYTE_AWS_ENDPOINT: <http://flyte-sandbox-minio.flyte:9000>
- FLYTE_AWS_ACCESS_KEY_ID: minio
- FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
default-pod-template-name: flyte-template
2. kubectl apply -f podTemplate.yaml which is
apiVersion: v1
kind: PodTemplate
metadata:
name: flyte-template
namespace: flyte
template:
spec:
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 64000Mi
containers:
- name: default
image: <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop>
volumeMounts:
- mountPath: /dev/shm
name: dshm
env:
- name: FOO
value: BAR
3. wait for a few mins and start a workflow.
Neither the volume nor the env vars are in the task container.jeev
03/18/2023, 3:59 AMNan Qin
03/18/2023, 4:00 AMWorkflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
however I have the podTemplate in both flyte
and flytesnacks-development
nsjeev
03/18/2023, 4:00 AMNan Qin
03/18/2023, 4:01 AM~/.flyte/sandbox/config.yaml
?jeev
03/18/2023, 4:01 AMNan Qin
03/18/2023, 4:02 AMtask_resources:
defaults:
cpu: 1
memory: 2Gi
storage: 32Gi
limits:
cpu: 8
memory: 128Gi
storage: 512Gi
inline:
plugins:
k8s:
default-pod-template-name: flyte-template
jeev
03/18/2023, 4:02 AMNan Qin
03/18/2023, 4:03 AMWorkflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
NAMESPACE NAME CONTAINERS IMAGES POD LABELS
flyte flyte-template default <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop> <none>
flytesnacks-development flyte-template default <http://docker.io/rwgrim/docker-noop|docker.io/rwgrim/docker-noop> <none>
jeev
03/18/2023, 4:05 AMNan Qin
03/18/2023, 4:05 AMjeev
03/18/2023, 4:05 AMNan Qin
03/18/2023, 4:07 AMWorkflow[flytesnacks:development:workflows.workflow.baby_training_wf] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [BadTaskSpecification] PodTemplate 'flyte-template' does not exist
Brian Tang
04/03/2023, 7:09 AM