Dan Farrell
10/31/2023, 7:47 PME1031 19:35:27.628662 66 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown" pod="flyte/flyte-sandbox-postgresql-0"
E1031 19:35:27.628687 66 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown" pod="flyte/flyte-sandbox-postgresql-0"
E1031 19:35:27.628758 66 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"flyte-sandbox-postgresql-0_flyte(43c0a74e-0d0d-46b9-b4e5-50d76ca102d7)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"flyte-sandbox-postgresql-0_flyte(43c0a74e-0d0d-46b9-b4e5-50d76ca102d7)\\\": rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown\"" pod="flyte/flyte-sandbox-postgresql-0" podUID="43c0a74e-0d0d-46b9-b4e5-50d76ca102d7"
W1031 19:35:28.620888 66 manager.go:1159] Failed to process watch event {EventType:0 Name:/kubepods/besteffort/pod174bfdd2-289e-45f4-8d89-3420e1fe3835/668db6ca7463b2bc7452a10b05dd6e48d41dc989cfb6b0771b958996958c76f2 WatchSource:0}: container "668db6ca7463b2bc7452a10b05dd6e48d41dc989cfb6b0771b958996958c76f2" in namespace "<http://k8s.io|k8s.io>": not found
I also have this error at the top of my logs:
sed: couldn't flush stdout: Device or resource busy
time="2023-10-31T19:41:12Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
time="2023-10-31T19:41:12Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/ab2055bc72380bad965b219e8688ac02b2e1b665cad6bdde1f8f087637aa81df"
time="2023-10-31T19:41:15Z" level=info msg="Starting k3s v1.28.2+k3s1 (6330a5b4)"
Does anyone have any idea why https://github.com/flyteorg/flyte/blob/master/docker/sandbox-bundled/bin/k3d-entrypoint-cgroupv2.sh#L19 this line might be failing?L godlike
11/01/2023, 2:14 AMDan Farrell
11/01/2023, 2:18 AML godlike
11/01/2023, 2:19 AMDan Farrell
11/01/2023, 2:31 AM> pyflyte run --remote runme.py check_if_gpu_available
Running Execution on Remote.
Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
details: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
Debug string UNKNOWN:Error received from peer {grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111", grpc_status:14, created_time:"2023-11-01T02:31:10.281095205+00:00"}
L godlike
11/01/2023, 2:32 AMpyflyte register runme.py
pip install flytekitplugins-envd
?Dan Farrell
11/01/2023, 2:33 AML godlike
11/01/2023, 2:33 AMpyflyte register runme.py
Dan Farrell
11/01/2023, 2:34 AM>kubectl get po -A
NAMESPACE NAME READY STATUS RESTARTS AGE
flyte flyte-sandbox-kubernetes-dashboard-6449889c8d-z6rl2 1/1 Running 0 46m
flyte flyte-sandbox-docker-registry-6f977cf857-m7tj2 1/1 Running 0 46m
kube-system coredns-59b4f5bbd5-9kwkl 1/1 Running 0 46m
flyte flyte-sandbox-proxy-6dc7cf6fbb-jmdjd 1/1 Running 0 46m
kube-system nvidia-device-plugin-daemonset-86hlw 1/1 Running 0 46m
kube-system local-path-provisioner-76d776f6f9-6gcrd 1/1 Running 0 46m
kube-system helm-install-nvidia-device-plugin-bm2rv 0/1 Completed 0 46m
kube-system nvidia-device-plugin-nsj96 1/1 Running 0 46m
flyte flyte-sandbox-postgresql-0 1/1 Running 0 46m
kube-system metrics-server-7b67f64457-8hsp9 1/1 Running 0 46m
flyte flyte-sandbox-minio-699885976d-kwwkp 1/1 Running 0 46m
flyte flyte-sandbox-buildkit-d55d5f857-d8kfn 1/1 Running 0 46m
default nvidia-smi 0/1 Pending 0 25m
flyte flyte-sandbox-6b6f9c7d7f-z8qjn 0/1 Pending 0 22m
>pyflyte register runme.py
Running pyflyte register from /home/dan.farrell/git/flyte/docker/sandbox-bundled with images ImageConfig(default_image=Image(name='default', fqn='<http://cr.flyte.org/flyteorg/flytekit|cr.flyte.org/flyteorg/flytekit>', tag='py3.11-1.10.0'), images=[Image(name='default', fqn='<http://cr.flyte.org/flyteorg/flytekit|cr.flyte.org/flyteorg/flytekit>', tag='py3.11-1.10.0')]) and image destination folder /root on 1 package(s) ('/home/dan.farrell/git/flyte/docker/sandbox-bundled/runme.py',)
Registering against localhost:30080
Detected Root /home/dan.farrell/git/flyte/docker/sandbox-bundled, using this to create deployable package...
No output path provided, using a temporary directory at /tmp/tmpspec9asa instead
Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
details: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
Debug string UNKNOWN:Error received from peer {created_time:"2023-11-01T02:33:04.364020811+00:00", grpc_status:14, grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111"}
L godlike
11/01/2023, 2:34 AMflyte flyte-sandbox-6b6f9c7d7f-z8qjn 0/1 Pending 0 22m
Dan Farrell
11/01/2023, 2:35 AML godlike
11/01/2023, 2:35 AMDan Farrell
11/01/2023, 2:35 AML godlike
11/01/2023, 2:35 AMDan Farrell
11/01/2023, 2:37 AML godlike
11/01/2023, 2:38 AMDan Farrell
11/01/2023, 2:38 AML godlike
11/01/2023, 2:38 AMkubectl describe node | grep -i gpu
Dan Farrell
11/01/2023, 2:40 AM<http://nvidia.com/gpu|nvidia.com/gpu>: 2
<http://nvidia.com/gpu|nvidia.com/gpu>: 2
<http://nvidia.com/gpu|nvidia.com/gpu> 0 0
L godlike
11/01/2023, 2:40 AMDan Farrell
11/01/2023, 2:41 AML godlike
11/01/2023, 2:41 AMDan Farrell
11/01/2023, 2:45 AML godlike
11/01/2023, 2:46 AMfuture-outlier
on github.kube-system nvidia-device-plugin-daemonset-86hlw 1/1 Running 0 46m
Dan Farrell
11/01/2023, 3:06 AM[4/4] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[fe825dcc166ff4208870-fnoahrfq-3] terminated with exit code (127). Reason [Error]. Message:
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
<https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license>
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: pyflyte-fast-execute: not found
L godlike
11/01/2023, 3:06 AMDan Farrell
11/01/2023, 3:07 AML godlike
11/01/2023, 3:07 AMpyflyte run --remote --image pingsutw/flytekit:dbPeB53UK_5Lz_mh7s4CpA.. check_gpu.py check_if_gpu_available
Dan Farrell
11/01/2023, 3:08 AML godlike
11/01/2023, 3:09 AMDan Farrell
11/01/2023, 3:11 AML godlike
11/01/2023, 3:12 AMDan Farrell
11/01/2023, 3:12 AML godlike
11/01/2023, 3:12 AMDan Farrell
11/01/2023, 3:13 AML godlike
11/01/2023, 3:13 AMDan Farrell
11/01/2023, 3:14 AML godlike
11/01/2023, 3:15 AMFROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
ENV CUDA_VERSION=12.2
ENV PYTHON_VERSION=3.9.13
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
# The following line is an example of how to install your modified plugins.
# RUN pip install -U git+<https://github.com/Yicheng-Lu-llll/flytekit.git@>"demo#egg=flytekitplugins-deck-standard&subdirectory=plugins/flytekit-deck-standard" # replace with your own repo and branch
RUN pip install flytekit==1.10.0
RUN pip install torch
Dan Farrell
11/01/2023, 3:20 AML godlike
11/01/2023, 3:21 AMDan Farrell
11/01/2023, 3:22 AML godlike
11/01/2023, 3:22 AMDan Farrell
11/01/2023, 3:22 AML godlike
11/01/2023, 3:22 AMDan Farrell
11/01/2023, 3:23 AML godlike
11/01/2023, 3:55 AM