was evaluating this PR: <https://github.com/flyteo...
# contribute
d
was evaluating this PR: https://github.com/flyteorg/flyte/pull/3256 but when I run flytectl demo start .... I get a lot of errors:
Copy code
E1031 19:35:27.628662      66 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown" pod="flyte/flyte-sandbox-postgresql-0"
E1031 19:35:27.628687      66 kuberuntime_manager.go:1166] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown" pod="flyte/flyte-sandbox-postgresql-0"
E1031 19:35:27.628758      66 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"flyte-sandbox-postgresql-0_flyte(43c0a74e-0d0d-46b9-b4e5-50d76ca102d7)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"flyte-sandbox-postgresql-0_flyte(43c0a74e-0d0d-46b9-b4e5-50d76ca102d7)\\\": rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown\"" pod="flyte/flyte-sandbox-postgresql-0" podUID="43c0a74e-0d0d-46b9-b4e5-50d76ca102d7"
W1031 19:35:28.620888      66 manager.go:1159] Failed to process watch event {EventType:0 Name:/kubepods/besteffort/pod174bfdd2-289e-45f4-8d89-3420e1fe3835/668db6ca7463b2bc7452a10b05dd6e48d41dc989cfb6b0771b958996958c76f2 WatchSource:0}: container "668db6ca7463b2bc7452a10b05dd6e48d41dc989cfb6b0771b958996958c76f2" in namespace "<http://k8s.io|k8s.io>": not found
I also have this error at the top of my logs:
Copy code
sed: couldn't flush stdout: Device or resource busy
time="2023-10-31T19:41:12Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
time="2023-10-31T19:41:12Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/ab2055bc72380bad965b219e8688ac02b2e1b665cad6bdde1f8f087637aa81df"
time="2023-10-31T19:41:15Z" level=info msg="Starting k3s v1.28.2+k3s1 (6330a5b4)"
Does anyone have any idea why https://github.com/flyteorg/flyte/blob/master/docker/sandbox-bundled/bin/k3d-entrypoint-cgroupv2.sh#L19 this line might be failing?
l
Will check this later, Sorry for the late reply
d
i have it almost figured out @L godlike
l
Thanks a lot, please share how it works and provide screenshots
If there's anything I can help, please tell me, really appreciated
d
@L godlike
Copy code
> pyflyte run --remote runme.py  check_if_gpu_available
Running Execution on Remote.
Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
        details: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
        Debug string UNKNOWN:Error received from peer  {grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111", grpc_status:14, created_time:"2023-11-01T02:31:10.281095205+00:00"}
is there any configuring to envd or flyte or kubectl that I need to do so pyflyte run can see it?
ohh
l
Can you also use
pyflyte register runme.py
did you install
pip install flytekitplugins-envd
?
Did you start your flyte cluster?
d
same thing, yes it is installed, yes flyte cluster is started
l
pleae give me the log about
pyflyte register runme.py
d
Copy code
>kubectl get po -A 
NAMESPACE     NAME                                                  READY   STATUS      RESTARTS   AGE
flyte         flyte-sandbox-kubernetes-dashboard-6449889c8d-z6rl2   1/1     Running     0          46m
flyte         flyte-sandbox-docker-registry-6f977cf857-m7tj2        1/1     Running     0          46m
kube-system   coredns-59b4f5bbd5-9kwkl                              1/1     Running     0          46m
flyte         flyte-sandbox-proxy-6dc7cf6fbb-jmdjd                  1/1     Running     0          46m
kube-system   nvidia-device-plugin-daemonset-86hlw                  1/1     Running     0          46m
kube-system   local-path-provisioner-76d776f6f9-6gcrd               1/1     Running     0          46m
kube-system   helm-install-nvidia-device-plugin-bm2rv               0/1     Completed   0          46m
kube-system   nvidia-device-plugin-nsj96                            1/1     Running     0          46m
flyte         flyte-sandbox-postgresql-0                            1/1     Running     0          46m
kube-system   metrics-server-7b67f64457-8hsp9                       1/1     Running     0          46m
flyte         flyte-sandbox-minio-699885976d-kwwkp                  1/1     Running     0          46m
flyte         flyte-sandbox-buildkit-d55d5f857-d8kfn                1/1     Running     0          46m
default       nvidia-smi                                            0/1     Pending     0          25m
flyte         flyte-sandbox-6b6f9c7d7f-z8qjn                        0/1     Pending     0          22m
Copy code
>pyflyte register runme.py
Running pyflyte register from /home/dan.farrell/git/flyte/docker/sandbox-bundled with images ImageConfig(default_image=Image(name='default', fqn='<http://cr.flyte.org/flyteorg/flytekit|cr.flyte.org/flyteorg/flytekit>', tag='py3.11-1.10.0'), images=[Image(name='default', fqn='<http://cr.flyte.org/flyteorg/flytekit|cr.flyte.org/flyteorg/flytekit>', tag='py3.11-1.10.0')]) and image destination folder /root on 1 package(s) ('/home/dan.farrell/git/flyte/docker/sandbox-bundled/runme.py',)
Registering against localhost:30080
Detected Root /home/dan.farrell/git/flyte/docker/sandbox-bundled, using this to create deployable package...
No output path provided, using a temporary directory at /tmp/tmpspec9asa instead
Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.UNAVAILABLE
        details: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
        Debug string UNKNOWN:Error received from peer  {created_time:"2023-11-01T02:33:04.364020811+00:00", grpc_status:14, grpc_message:"upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111"}
l
Why these 2 are pending?
I think the reason is because "flyte flyte-sandbox-6b6f9c7d7f-z8qjn 0/1 Pending 0 22m"
flyte         flyte-sandbox-6b6f9c7d7f-z8qjn                        0/1     Pending     0          22m
d
oh
let me see
l
You have to wait for this pod start, if it doesn't there's something run in the pod
d
oh it is from my taint
1 sec
l
And the root cause might because of the Image or the settings of GPU
No problem
d
🎉
thank you, sorry was too sleepy to read
l
Does it succeeded?
d
i made a lot of changes, I am sorry this may be confusing
it is building
l
Does the node has GPU?
I mean
kubectl describe node | grep -i gpu
Can you provide this information?
d
yes it does
Copy code
<http://nvidia.com/gpu|nvidia.com/gpu>:     2
  <http://nvidia.com/gpu|nvidia.com/gpu>:     2
  <http://nvidia.com/gpu|nvidia.com/gpu>     0           0
l
Can you provide screenshots?
I can paste it to the PR desciprtion
d
i can do that
l
Really thanks a lot
Does the setting has any wrong?
I mean the Guidance
I've added this yesterday.
d
i will post a diff
l
Thanks a lot
Love you
Please mention me when you finish, I am
future-outlier
on github.
I will help you review and mention other maintainers
Really appreciate
@Dan Farrell 1. Can you provide the screenshots after the task succeeded? 2. Can you clarify whether this pod is necessary or not?
kube-system   nvidia-device-plugin-daemonset-86hlw                  1/1     Running     0          46m
Thanks a lot
d
the job fails with:
Copy code
[4/4] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.

[fe825dcc166ff4208870-fnoahrfq-3] terminated with exit code (127). Reason [Error]. Message: 
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
<https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license>

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/opt/nvidia/nvidia_entrypoint.sh: line 67: exec: pyflyte-fast-execute: not found
seems envd is doing something off
l
Can you try this alternative?
d
i know what to do
l
Copy code
pyflyte run --remote --image pingsutw/flytekit:dbPeB53UK_5Lz_mh7s4CpA..  check_gpu.py  check_if_gpu_available
Nice, wait for your reply
Appreciate
d
is it ok if I push onto your branch?
or do you want to try my changes on windows?
l
The PR is not my branch, so I have no permissions
I think you can create a PR in flyte.
If the author of the PR doesn't apply the diff, we can consider merge yours
I can't contact the author for 3 days
If you use ImageSpec, please remember to change your registry.
d
build is slow -_- seems like something is wrong with caching
l
You can use this
d
I will just make another PR, but make it a draft, because merging diffs is terrible
l
I give you Dcoekrfile now.
Thanks a lot
d
do i add checksum/secret: XXXXX stuff to the PR?
for b/docker/sandbox-bundled/manifests/complete.yaml
l
The checksum
just ignore it
it is kind of a legacy, it doesn't affect anything
you can just push it
it will not affect the repo anything
d
thought so
l
I will give you the Dockerfile in 2 minutes
please wati
wait
Copy code
FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
ENV CUDA_VERSION=12.2
ENV PYTHON_VERSION=3.9.13

RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
# The following line is an example of how to install your modified plugins. 
# RUN pip install -U git+<https://github.com/Yicheng-Lu-llll/flytekit.git@>"demo#egg=flytekitplugins-deck-standard&subdirectory=plugins/flytekit-deck-standard" # replace with your own repo and branch
RUN pip install flytekit==1.10.0
RUN pip install torch
@Dan Farrell
d
l
Let me help you, I will give you some advice in the PR.
It will be great to provide more information to reviewers
Thx
d
image.png
l
Oh my god
d
Yes I will update it with your walkthrough
l
We did it
Love you......
It bother me for 2 weeks
d
🎉
you made me not sleep for 2 days
i hate you
❤️
ok, I am done for the night. I think I will try and recreate this with a fresh VM tomorrow
l
Thanks a lot, please go to rest
I will leave messages to you, really appreciate again