incalculable-ice-13425
08/08/2023, 12:12 PMmy-task = ContainerTask(
metadata=TaskMetadata(cache=True, cache_version="1.0"),
name="my-task",
image="my-image",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(inDir=str),
outputs=kwtypes(out=str),
requests=Resources(gpu="1"),
limits=Resources(gpu="1"),
command=[
"/bin/bash",
],
arguments=[
"-c",
"echo \"out\" > /var/outputs/out; ... other commands"
],
....
)
I wanted to cache the task, for which I found that I had to put inputs/outputs even though I don’t need them. So, I just a string “out” in /var/outputs/out
as shown in the arguments
and put a string in the inDir
as below while calling the task.
@workflow
def aeb_sanity_workflow(data: Dict):
## -----------------------------------------------------------------------------
.......
my_task_promise = my-task(inDir="some string")
........
This was working for me with earlier version of Flyte mentioned below
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
However, I use the master version of flyte patched with https://github.com/flyteorg/flyte/pull/3256 and manually built in docker/sandbox-bundled
using make build-gpu
because I needed gpu support in sandbox.
I’m seeing that with this latest version, I saw two issues which were not there with <http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1
1. For the above mentioned ContainerTask, It’s throwing errors saying output doesn’t exist after workflow execution. I haven’t changed a single line of code in in the my-task
except the latest flyte image.
2. Also, for the task that needs GPU, since, the image size is huge ~24 GB, k8 node came under disk pressure, and severals pods were evicted.
> kubectl describe pod <gpu-pod>
Warning Evicted 8m10s (x3 over 9m30s) kubelet The node was low on resource: ephemeral-storage.
Warning ExceededGracePeriod 8m (x3 over 9m20s) kubelet Container runtime did not kill the pod within specified grace period.
Normal Pulled 7m59s kubelet Successfully pulled image "my-gpu-image" in 8m40.26240502s
Normal Created 7m59s kubelet Created container primary
Normal Started 7m58s kubelet Started container primary
Normal Killing 7m58s kubelet Stopping container primary
Warning Evicted 7m30s kubelet The node was low on resource: ephemeral-storage. Container primary was using 13516Ki, which exceeds its request of 0.
> kubectl describe nodes <>
Warning FreeDiskSpaceFailed 52m kubelet failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
Warning ImageGCFailed 52m kubelet failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
Warning FreeDiskSpaceFailed 47m kubelet failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
Warning ImageGCFailed 47m kubelet failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
Warning EvictionThresholdMet 7m56s (x3 over 11m) kubelet Attempting to reclaim ephemeral-storage
Normal NodeNotReady 7m49s node-controller Node 1fefe346c083 status is now: NodeNotReady
Normal NodeHasSufficientMemory 7m47s (x3 over 57m) kubelet Node 1fefe346c083 status is now: NodeHasSufficientMemory
Normal NodeHasDiskPressure 7m47s (x2 over 11m) kubelet Node 1fefe346c083 status is now: NodeHasDiskPressure
I didn’t observe these issues in this image <http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1
incalculable-ice-13425
08/08/2023, 12:14 PMincalculable-ice-13425
08/08/2023, 12:34 PMincalculable-ice-13425
08/08/2023, 12:34 PMfreezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 1:36 PM1. make build-gpu
2. make start
I couldn’t use flytectl demo start --image <>
because it’s exiting immidiately. This is because --gpus all
was required in docker run which I manually added in Makefile before calling make start
.
This is what I found the kubectl describe node <>
in this case
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system nvidia-device-plugin-daemonset-45mbf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m
kube-system metrics-server-667586758d-8fflg 100m (1%) 0 (0%) 70Mi (0%) 0 (0%) 37m
kube-system coredns-7b5bbc6644-hsk6n 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 37m
kube-system local-path-provisioner-687d6d7765-dsgmm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 200m (2%) 0 (0%)
memory 140Mi (0%) 170Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
<http://nvidia.com/gpu|nvidia.com/gpu> 0 0
However, In the case of normal sandbox cluster using flytectl demo start
, below was the node capacitiy
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system local-path-provisioner-7b7dc8d6f5-lb2rx 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m57s
flyte flyte-sandbox-kubernetes-dashboard-6757db879c-vfvnw 100m (1%) 2 (25%) 200Mi (0%) 200Mi (0%) 4m57s
kube-system coredns-b96499967-h5np6 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 4m57s
kube-system metrics-server-668d979685-5dvtl 100m (1%) 0 (0%) 70Mi (0%) 0 (0%) 4m57s
flyte flyte-sandbox-docker-registry-6494d7666-mpq9t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m57s
flyte flyte-sandbox-proxy-d95874857-v5vrh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m57s
flyte flyte-sandbox-postgresql-0 250m (3%) 0 (0%) 256Mi (0%) 0 (0%) 4m57s
flyte flyte-sandbox-minio-645c8ddf7c-h5wbn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m57s
flyte flyte-sandbox-69c7f848db-g9psq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4m57s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 550m (6%) 2 (25%)
memory 596Mi (0%) 370Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
I can see that there is lot of difference in the nodes capapcities for both the cases, which is causing 1st one to crash because of disk issue.incalculable-ice-13425
08/08/2023, 1:38 PM--gpus all
to docker runfreezing-boots-56761
incalculable-ice-13425
08/08/2023, 1:41 PMflytectl demo start --image <>
is not working. I have the docker config with NVIDIA runtime only. We need to add --gpus all
to docker run, otherwise sandbox will not have access to gpu and that’s why it’s crashing immidiately. Other user in the patch comment also faced the same issue.incalculable-ice-13425
08/08/2023, 1:42 PM--gpus all
to work, we must have docker config with NVIDIA runtime, so that part is correct. Only missing part is calling docker run --gpus all
with flytectl demo start --image<>
cmd.freezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 1:43 PMhmm i see. ok. w.r.t to the capacities, it looks like the difference is just due to a missing Flyte namespace.Look at the
Allocated resources:
, even that is less for the 1st case make start
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 1:44 PMis that because the Flyte pods are crashlooping?
It’s happening during the stage when pod pulls the image ~24Gb, when node starts falliing short of ephemeral storage. It’s working fine in normal case flytectl demo start
. Tried multiple timesfreezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 1:46 PMincalculable-ice-13425
08/08/2023, 1:47 PMmake start
or pass the --gpus all
flag when calling using flytectl demo start --image <>
. I think either of them will solve the issue.freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:00 PMhttps://github.com/flyteorg/flyte/pull/3256#issuecomment-1472279870 doesn’t work?Yes, tried multiple times, doesn’t work. I’ve patched his changed on master repo, then did
make build-gpu
and make start
.
Note that, I had to remove manifest-gpu
from build-gpu
for make build-gpu
to produce a docker image.incalculable-ice-13425
08/08/2023, 2:01 PMkubectl describe nodes 8bb90a7d22d8
Name: 8bb90a7d22d8
Roles: control-plane,master
Labels: <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
<http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
<http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
<http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
<http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
<http://kubernetes.io/hostname=8bb90a7d22d8|kubernetes.io/hostname=8bb90a7d22d8>
<http://kubernetes.io/os=linux|kubernetes.io/os=linux>
<http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
<http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
<http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations: <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"e6:0c:35:ea:3b:5b"}
<http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
<http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
<http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
<http://k3s.io/hostname|k3s.io/hostname>: 8bb90a7d22d8
<http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
<http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
<http://k3s.io/node-config-hash|k3s.io/node-config-hash>: DTM2Y77ISYLRTGPA5HIDT5VGTAXMZ5BQ5HF4OYULEA4KHR2EII4A====
<http://k3s.io/node-env|k3s.io/node-env>: {"K3S_KUBECONFIG_OUTPUT":"/var/lib/flyte/config/kubeconfig"}
<http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
<http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp: Tue, 08 Aug 2023 19:22:27 +0530
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: 8bb90a7d22d8
AcquireTime: <unset>
RenewTime: Tue, 08 Aug 2023 19:24:45 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 08 Aug 2023 19:22:58 +0530 Tue, 08 Aug 2023 19:22:26 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 08 Aug 2023 19:22:58 +0530 Tue, 08 Aug 2023 19:22:26 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 08 Aug 2023 19:22:58 +0530 Tue, 08 Aug 2023 19:22:26 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 08 Aug 2023 19:22:58 +0530 Tue, 08 Aug 2023 19:22:37 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.0.2
Hostname: 8bb90a7d22d8
Capacity:
cpu: 8
ephemeral-storage: 944801904Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65774996Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 919103291491
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65774996Ki
pods: 110
System Info:
Machine ID:
System UUID: 324ab640-d7da-11dd-b4b7-b06ebfc7723f
Boot ID: fe7b217c-8f61-4568-93b0-a897170f1db9
Kernel Version: 5.15.0-78-generic
OS Image: K3s dev
Operating System: linux
Architecture: amd64
Container Runtime Version: <containerd://1.6.6-k3s1>
Kubelet Version: v1.24.4+k3s1
Kube-Proxy Version: v1.24.4+k3s1
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: <k3s://8bb90a7d22d8>
Non-terminated Pods: (10 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system coredns-b96499967-8lz84 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 3m30s
kube-system local-path-provisioner-7b7dc8d6f5-q2qtw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
flyte flyte-sandbox-kubernetes-dashboard-6757db879c-txvl5 100m (1%) 2 (25%) 200Mi (0%) 200Mi (0%) 3m30s
kube-system metrics-server-668d979685-7xl5k 100m (1%) 0 (0%) 70Mi (0%) 0 (0%) 3m30s
flyte flyte-sandbox-docker-registry-7ddfcc58ff-zvlvj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
flyte flyte-sandbox-proxy-d95874857-4g2p7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
flyte flyte-sandbox-postgresql-0 250m (3%) 0 (0%) 256Mi (0%) 0 (0%) 3m30s
flyte flyte-sandbox-minio-645c8ddf7c-wkgdk 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
flyte flyte-sandbox-buildkit-7d7d55dbb-kh949 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
flyte flyte-sandbox-98749fb56-bsvsw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3m30s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 550m (6%) 2 (25%)
memory 596Mi (0%) 370Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 2m22s kube-proxy
Normal Starting 2m24s kubelet Starting kubelet.
Warning InvalidDiskCapacity 2m24s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 2m24s (x2 over 2m24s) kubelet Node 8bb90a7d22d8 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m24s (x2 over 2m24s) kubelet Node 8bb90a7d22d8 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m24s (x2 over 2m24s) kubelet Node 8bb90a7d22d8 status is now: NodeHasSufficientPID
Normal Synced 2m23s cloud-node-controller Node synced successfully
Normal NodeAllocatableEnforced 2m23s kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 2m20s node-controller Node 8bb90a7d22d8 event: Registered Node 8bb90a7d22d8 in Controller
Normal NodeReady 2m13s kubelet Node 8bb90a7d22d8 status is now: NodeReady
incalculable-ice-13425
08/08/2023, 2:02 PMmake start
kubectl describe nodes 9a694765b4ca
Name: 9a694765b4ca
Roles: control-plane,master
Labels: <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
<http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
<http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
<http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
<http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
<http://kubernetes.io/hostname=9a694765b4ca|kubernetes.io/hostname=9a694765b4ca>
<http://kubernetes.io/os=linux|kubernetes.io/os=linux>
<http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
<http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
<http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations: <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"22:24:c9:c3:a8:f8"}
<http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
<http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
<http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
<http://k3s.io/hostname|k3s.io/hostname>: 9a694765b4ca
<http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
<http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
<http://k3s.io/node-config-hash|k3s.io/node-config-hash>: 4J5AJPVISIGUNB3TJQ23D56IPSX5PBUDZLHDT7KGRNZTZCZG2KXA====
<http://k3s.io/node-env|k3s.io/node-env>:
{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cc73cc54f96e096349faa656c251469ebefa4f9ab0d3f356ea6895ff145dcd1e","K3S_KUBECONFIG_OUTPUT":"/.ku...
<http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
<http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp: Tue, 08 Aug 2023 17:59:08 +0530
Taints: <http://node.kubernetes.io/disk-pressure:NoSchedule|node.kubernetes.io/disk-pressure:NoSchedule>
Unschedulable: false
Lease:
HolderIdentity: 9a694765b4ca
AcquireTime: <unset>
RenewTime: Tue, 08 Aug 2023 18:35:39 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 08 Aug 2023 18:35:11 +0530 Tue, 08 Aug 2023 18:29:23 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 08 Aug 2023 18:35:11 +0530 Tue, 08 Aug 2023 18:31:57 +0530 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Tue, 08 Aug 2023 18:35:11 +0530 Tue, 08 Aug 2023 18:29:23 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 08 Aug 2023 18:35:11 +0530 Tue, 08 Aug 2023 18:29:23 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.17.0.2
Hostname: 9a694765b4ca
Capacity:
cpu: 8
ephemeral-storage: 944801904Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65774996Ki
<http://nvidia.com/gpu|nvidia.com/gpu>: 1
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 919103291491
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65774996Ki
<http://nvidia.com/gpu|nvidia.com/gpu>: 1
pods: 110
System Info:
Machine ID:
System UUID: 324ab640-d7da-11dd-b4b7-b06ebfc7723f
Boot ID: fe7b217c-8f61-4568-93b0-a897170f1db9
Kernel Version: 5.15.0-78-generic
OS Image: Ubuntu 20.04.6 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: <containerd://1.6.12-k3s1>
Kubelet Version: v1.24.9+k3s1
Kube-Proxy Version: v1.24.9+k3s1
PodCIDR: 10.42.0.0/24
PodCIDRs: 10.42.0.0/24
ProviderID: <k3s://9a694765b4ca>
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system nvidia-device-plugin-daemonset-45mbf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 11m
kube-system metrics-server-667586758d-8fflg 100m (1%) 0 (0%) 70Mi (0%) 0 (0%) 37m
kube-system coredns-7b5bbc6644-hsk6n 100m (1%) 0 (0%) 70Mi (0%) 170Mi (0%) 37m
kube-system local-path-provisioner-687d6d7765-dsgmm 0 (0%) 0 (0%) 0 (0%) 0 (0%) 37m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 200m (2%) 0 (0%)
memory 140Mi (0%) 170Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
<http://nvidia.com/gpu|nvidia.com/gpu> 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 36m kube-proxy
Normal Synced 36m cloud-node-controller Node synced successfully
Normal Starting 36m kubelet Starting kubelet.
Warning InvalidDiskCapacity 36m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 36m kubelet Updated Node Allocatable limit across pods
Normal RegisteredNode 36m node-controller Node 9a694765b4ca event: Registered Node 9a694765b4ca in Controller
Warning FreeDiskSpaceFailed 31m kubelet failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
Warning ImageGCFailed 31m kubelet failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
Warning ImageGCFailed 26m kubelet failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 26m kubelet failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
Warning ImageGCFailed 21m kubelet failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 21m kubelet failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 16m kubelet failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
Warning ImageGCFailed 16m kubelet failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
Warning ImageGCFailed 11m kubelet failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 11m kubelet failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
Warning FreeDiskSpaceFailed 6m33s kubelet failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
Warning ImageGCFailed 6m33s kubelet failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
Normal NodeNotReady 6m25s (x2 over 17m) node-controller Node 9a694765b4ca status is now: NodeNotReady
Normal NodeHasSufficientMemory 6m19s (x4 over 36m) kubelet Node 9a694765b4ca status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 6m19s (x4 over 36m) kubelet Node 9a694765b4ca status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 6m19s (x4 over 36m) kubelet Node 9a694765b4ca status is now: NodeHasSufficientPID
Normal NodeReady 6m19s (x3 over 36m) kubelet Node 9a694765b4ca status is now: NodeReady
Warning EvictionThresholdMet 3m52s kubelet Attempting to reclaim ephemeral-storage
Warning FreeDiskSpaceFailed 92s kubelet failed to garbage collect required amount of images. Wanted to free 142914827059 bytes, but freed 284035241 bytes
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:03 PMmanifest-gpu
from the Makefile in build-gpu
be the culprit ? 🤔freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:08 PMdocker system prune
sometime back to see if help, it didn’t. Do you think this docker system prune -a —volumes
would be worth the shot ?freezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:10 PM2 dashes before volumesOk. Sure, let me do that.
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:18 PM❯ df -H
Filesystem Size Used Avail Use% Mounted on
udev 34G 0 34G 0% /dev
tmpfs 6.8G 2.3M 6.8G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 968G 838G 81G 92% /
tmpfs 34G 52M 34G 1% /dev/shm
tmpfs 5.3M 4.1k 5.3M 1% /run/lock
tmpfs 34G 0 34G 0% /sys/fs/cgroup
/dev/loop2 59M 59M 0 100% /snap/core18/2785
/dev/loop5 97M 97M 0 100% /snap/lxd/24061
/dev/loop3 97M 97M 0 100% /snap/lxd/23991
/dev/loop1 67M 67M 0 100% /snap/core20/1950
/dev/loop6 56M 56M 0 100% /snap/snapd/19457
/dev/loop7 56M 56M 0 100% /snap/snapd/19361
/dev/loop8 337M 337M 0 100% /snap/vlc/3078
/dev/loop4 67M 67M 0 100% /snap/core20/1974
/dev/loop0 59M 59M 0 100% /snap/core18/2751
/dev/sda2 11G 434M 9.5G 5% /boot
/dev/sda1 5.4G 5.5M 5.4G 1% /boot/efi
tmpfs 6.8G 21k 6.8G 1% /run/user/1776609218
incalculable-ice-13425
08/08/2023, 2:19 PMfreezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:20 PMincalculable-ice-13425
08/08/2023, 2:22 PMfreezing-boots-56761
freezing-boots-56761
incalculable-ice-13425
08/08/2023, 2:32 PMdocker prune
and 100G from deleting cache
. It deleted the sandbox-gpu image as well. Let me build it again and tryincalculable-ice-13425
08/08/2023, 2:45 PMflytectl demo start --image <>
, container exiting immidiately.incalculable-ice-13425
08/08/2023, 3:14 PMincalculable-ice-13425
08/08/2023, 3:15 PMmake start
as well.freezing-boots-56761
quick-salesclerk-18019
08/08/2023, 3:46 PMincalculable-ice-13425
08/09/2023, 7:02 AMflytectl demo start --image <>
is working fine with GPU compatible sandbox image with these changes https://github.com/flyteorg/flyte/pull/3256#issuecomment-1670780686incalculable-ice-13425
08/10/2023, 8:25 AMmy_task = ContainerTask(
metadata=TaskMetadata(cache=True, cache_version="1.2"),
name="my_task",
image="my-task-image",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(inDir=str),
outputs=kwtypes(out=str),
command=[
"/bin/bash",
],
arguments=[
"-c",
"echo \"out\" > /var/outputs/out; ... other commands",
],
)
As done above, I’m creating out
file in /var/outputs
. This was working in flyte version 1.8.1, however in the master version, I’m getting the below error during task execution in UI.
[1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
Note that, If I comment the line outputs=kwtypes(out=str)
, execution passed, but the task is not cached. I’m assuming some change happened between 1.8.1 and latest version that changed this behaviour.incalculable-ice-13425
08/10/2023, 1:09 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
incalculable-ice-13425
08/10/2023, 1:27 PMarguments=[
"-c",
"echo \"out\" > /var/outputs/out; ... other commands",
],
incalculable-ice-13425
08/10/2023, 1:27 PMIf this is a regression we will fix it - can you downgrade and it works?
this works in 1.8.1freezing-airport-6809
freezing-airport-6809
incalculable-ice-13425
08/10/2023, 1:29 PM/var/outputs/out
it is out
. It was supposed to be "out"
freezing-airport-6809
incalculable-ice-13425
08/10/2023, 1:30 PM"out"
during execution, it is getting cached. Not sure what leads to this behaviorincalculable-ice-13425
08/10/2023, 1:30 PMIf I check the contents ofThis is the problem. It’s not puttingit is/var/outputs/out
. It was supposed to beout
"out"
"out"
which it should I guess, it’s putting out
freezing-airport-6809
incalculable-ice-13425
08/10/2023, 1:31 PMkubectl exec ..
) and then change the file content to "out"
, it worked in that casefreezing-airport-6809
freezing-airport-6809
incalculable-ice-13425
08/10/2023, 1:34 PMincalculable-ice-13425
08/10/2023, 1:34 PMthis implies we are not casting string correctlyYes, I believe
incalculable-ice-13425
08/10/2023, 3:44 PMarguments=[
"-c",
"echo \"out\" > /var/outputs/out; ... other commands",
],
So, there is definitely a regression on caching for ContainerTask on master branch.high-accountant-32689
08/10/2023, 6:07 PMfrom flytekit import ContainerTask, TaskMetadata, kwtypes
my_task = ContainerTask(
metadata=TaskMetadata(cache=True, cache_version="v1"),
name="my_task",
image="ubuntu:latest",
input_data_dir="/var/inputs",
output_data_dir="/var/outputs",
inputs=kwtypes(),
outputs=kwtypes(out=str),
command=[
"/bin/bash",
],
arguments=[
"-c",
"echo \"out\" > /var/outputs/out",
],
)
Also, minor I know, but I see the escaped double-quotes in the output using the ubuntu:latest
image.high-accountant-32689
08/10/2023, 6:08 PMincalculable-ice-13425
08/10/2023, 6:16 PMmake build
in docker/sandbox-bundled
.incalculable-ice-13425
08/10/2023, 6:23 PM\
, that’s why showing echo "out"
, but it was echo \"out\"
in the codeincalculable-ice-13425
08/10/2023, 6:27 PMhigh-accountant-32689
08/10/2023, 6:54 PMincalculable-ice-13425
08/10/2023, 7:05 PM"echo \"out\" > /var/outputs/out; ... ; my-bin --arguments '{\"arg_1\": \"\\\"1\\\"\", \"arg_2\": \"\\\"2\\\"\"}' "
. Not sure if that plays the role.high-accountant-32689
08/10/2023, 9:50 PMincalculable-ice-13425
08/11/2023, 5:43 AMmanifests-gpu
from build-gpu
target as it was failing with helm parsing issue for me, if I trigger make build-gpu
with that. Removing manifest-gpu
worked for me.
diff --git a/docker/sandbox-bundled/Dockerfile.gpu b/docker/sandbox-bundled/Dockerfile.gpu
index 93d8afbb6..6b5ec90b6 100644
--- a/docker/sandbox-bundled/Dockerfile.gpu
+++ b/docker/sandbox-bundled/Dockerfile.gpu
@@ -26,7 +26,7 @@ RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/r
# syntax=docker/dockerfile:1.4-labs
#Following
-FROM nvidia/cuda:11.4.3-base-ubuntu20.04
+FROM nvidia/cuda:11.4.0-base-ubuntu20.04
RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
@@ -76,9 +76,6 @@ VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni
VOLUME /var/log
-ENV NVIDIA_VISIBLE_DEVICES="all"
-ENV NVIDIA_DRIVER_CAPABILITIES="all"
-RUN nvidia-ctk runtime configure --runtime=docker --set-as-default
ENTRYPOINT [ "/bin/k3d-entrypoint.sh" ]
-CMD [ "server", "--disable=traefik", "--disable=servicelb", "--kubelet-arg=allowed-unsafe-sysctls=fs.mqueue.*" ]
+CMD [ "server", "--disable=traefik", "--disable=servicelb" ]
diff --git a/docker/sandbox-bundled/Makefile b/docker/sandbox-bundled/Makefile
index 9eb9970f0..e0e32e530 100644
--- a/docker/sandbox-bundled/Makefile
+++ b/docker/sandbox-bundled/Makefile
@@ -44,7 +44,7 @@ build: flyte manifests
--tag flyte-sandbox:latest .
.PHONY: build-gpu
-build-gpu: flyte
+build-gpu: flyte manifests-gpu
[ -n "$(shell docker buildx ls | awk '/^flyte-sandbox / {print $$1}')" ] || \
docker buildx create --name flyte-sandbox \
--driver docker-container --driver-opt image=moby/buildkit:master \
incalculable-ice-13425
08/11/2023, 5:45 AMincalculable-ice-13425
08/11/2023, 5:46 AMfreezing-boots-56761