Hi, I have a ContianerTask as shown below ```my-ta...
# ask-the-community
g
Hi, I have a ContianerTask as shown below
Copy code
my-task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="1.0"),
    name="my-task",
    image="my-image",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inDir=str),
    outputs=kwtypes(out=str),
    requests=Resources(gpu="1"),
    limits=Resources(gpu="1"),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands"
        ],
   ....
)
I wanted to cache the task, for which I found that I had to put inputs/outputs even though I don’t need them. So, I just a string “out” in
/var/outputs/out
as shown in the
arguments
and put a string in the
inDir
as below while calling the task.
Copy code
@workflow
def aeb_sanity_workflow(data: Dict):
    ## -----------------------------------------------------------------------------
    .......
    my_task_promise = my-task(inDir="some string")
    ........
This was working for me with earlier version of Flyte mentioned below
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
However, I use the master version of flyte patched with https://github.com/flyteorg/flyte/pull/3256 and manually built in
docker/sandbox-bundled
using
make build-gpu
because I needed gpu support in sandbox. I’m seeing that with this latest version, I saw two issues which were not there with
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1
1. For the above mentioned ContainerTask, It’s throwing errors saying output doesn’t exist after workflow execution. I haven’t changed a single line of code in in the
my-task
except the latest flyte image. 2. Also, for the task that needs GPU, since, the image size is huge ~24 GB, k8 node came under disk pressure, and severals pods were evicted.
Copy code
> kubectl describe pod <gpu-pod>
  Warning  Evicted              8m10s (x3 over 9m30s)  kubelet            The node was low on resource: ephemeral-storage.
  Warning  ExceededGracePeriod  8m (x3 over 9m20s)     kubelet            Container runtime did not kill the pod within specified grace period.
  Normal   Pulled               7m59s                  kubelet            Successfully pulled image "my-gpu-image" in 8m40.26240502s
  Normal   Created              7m59s                  kubelet            Created container primary
  Normal   Started              7m58s                  kubelet            Started container primary
  Normal   Killing              7m58s                  kubelet            Stopping container primary
  Warning  Evicted              7m30s                  kubelet            The node was low on resource: ephemeral-storage. Container primary was using 13516Ki, which exceeds its request of 0.
Copy code
> kubectl describe nodes <>
Warning  FreeDiskSpaceFailed      52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  FreeDiskSpaceFailed      47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  ImageGCFailed            47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet     7m56s (x3 over 11m)    kubelet                Attempting to reclaim ephemeral-storage
  Normal   NodeNotReady             7m49s                  node-controller        Node 1fefe346c083 status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  7m47s (x3 over 57m)    kubelet                Node 1fefe346c083 status is now: NodeHasSufficientMemory
  Normal   NodeHasDiskPressure      7m47s (x2 over 11m)    kubelet                Node 1fefe346c083 status is now: NodeHasDiskPressure
I didn’t observe these issues in this image
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1
tagging @jeev since you are aware about the context.
error for the first issue
error for second issue
j
How big of a disk are you using? If on a mac, what is the allocation?
The nvidia gpu image is large
g
I think the issue is the way we are starting the cluster. For the one with latest image with patch for gpus, I started cluster by doing this
Copy code
1. make build-gpu
2. make start
I couldn’t use
flytectl demo start --image <>
because it’s exiting immidiately. This is because
--gpus all
was required in docker run which I manually added in Makefile before calling
make start
. This is what I found the
kubectl describe node <>
in this case
Copy code
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 nvidia-device-plugin-daemonset-45mbf       0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 metrics-server-667586758d-8fflg            100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         37m
  kube-system                 coredns-7b5bbc6644-hsk6n                   100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     37m
  kube-system                 local-path-provisioner-687d6d7765-dsgmm    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (2%)   0 (0%)
  memory             140Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  <http://nvidia.com/gpu|nvidia.com/gpu>     0           0
However, In the case of normal sandbox cluster using
flytectl demo start
, below was the node capacitiy
Copy code
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 local-path-provisioner-7b7dc8d6f5-lb2rx                0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-kubernetes-dashboard-6757db879c-vfvnw    100m (1%)     2 (25%)     200Mi (0%)       200Mi (0%)     4m57s
  kube-system                 coredns-b96499967-h5np6                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     4m57s
  kube-system                 metrics-server-668d979685-5dvtl                        100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         4m57s
  flyte                       flyte-sandbox-docker-registry-6494d7666-mpq9t          0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-proxy-d95874857-v5vrh                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-postgresql-0                             250m (3%)     0 (0%)      256Mi (0%)       0 (0%)         4m57s
  flyte                       flyte-sandbox-minio-645c8ddf7c-h5wbn                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-69c7f848db-g9psq                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                550m (6%)   2 (25%)
  memory             596Mi (0%)  370Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
I can see that there is lot of difference in the nodes capapcities for both the cases, which is causing 1st one to crash because of disk issue.
Please open the thread in new window for better visibilities, apologies. Is there any way I can start the cluster for the 1st method with gpu support that adds necessary capacities as done in normal case along with adding
--gpus all
to docker run
j
iirc, in the PR, they were able to get it working by modifying the docker config only, as opposed to setting the gpus flag. if so, flytectl demo start should work. did you try that?
g
No.
flytectl demo start --image <>
is not working. I have the docker config with NVIDIA runtime only. We need to add
--gpus all
to docker run, otherwise sandbox will not have access to gpu and that’s why it’s crashing immidiately. Other user in the patch comment also faced the same issue.
For
--gpus all
to work, we must have docker config with NVIDIA runtime, so that part is correct. Only missing part is calling docker run
--gpus all
with
flytectl demo start --image<>
cmd.
j
hmm i see. ok. w.r.t to the capacities, it looks like the difference is just due to a missing Flyte namespace.
is that because the Flyte pods are crashlooping?
g
hmm i see. ok. w.r.t to the capacities, it looks like the difference is just due to a missing Flyte namespace.
Look at the
Allocated resources:
, even that is less for the 1st case
make start
j
Yes because there are less pods. You want to look at Allocatable capacity, a bit higher in the output
g
is that because the Flyte pods are crashlooping?
It’s happening during the stage when pod pulls the image ~24Gb, when node starts falliing short of ephemeral storage. It’s working fine in normal case
flytectl demo start
. Tried multiple times
j
what machine is this running on?
how big is your disk rather?
g
linux. 1TB, 64GB RAM
Is there a way we can make the node capacities same when done using
make start
or pass the
--gpus all
flag when calling using
flytectl demo start --image <>
. I think either of them will solve the issue.
j
The node capacities should be the same I’d think. Can you paste the full output of “kubectl describe nodes” in both scenarios?
Passing the gpus flag will require a change to flytectl. GPU support hasn’t been planned yet afaik.
I don’t see why this won’t work with a 1TB disk.
g
https://github.com/flyteorg/flyte/pull/3256#issuecomment-1472279870 doesn’t work?
Yes, tried multiple times, doesn’t work. I’ve patched his changed on master repo, then did
make build-gpu
and
make start
. Note that, I had to remove
manifest-gpu
from
build-gpu
for
make build-gpu
to produce a docker image.
Here’s the one for normal case
Copy code
kubectl describe nodes 8bb90a7d22d8
Name:               8bb90a7d22d8
Roles:              control-plane,master
Labels:             <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
                    <http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
                    <http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
                    <http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
                    <http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
                    <http://kubernetes.io/hostname=8bb90a7d22d8|kubernetes.io/hostname=8bb90a7d22d8>
                    <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
                    <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
                    <http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
                    <http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations:        <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"e6:0c:35:ea:3b:5b"}
                    <http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
                    <http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
                    <http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
                    <http://k3s.io/hostname|k3s.io/hostname>: 8bb90a7d22d8
                    <http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
                    <http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
                    <http://k3s.io/node-config-hash|k3s.io/node-config-hash>: DTM2Y77ISYLRTGPA5HIDT5VGTAXMZ5BQ5HF4OYULEA4KHR2EII4A====
                    <http://k3s.io/node-env|k3s.io/node-env>: {"K3S_KUBECONFIG_OUTPUT":"/var/lib/flyte/config/kubeconfig"}
                    <http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
                    <http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp:  Tue, 08 Aug 2023 19:22:27 +0530
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  8bb90a7d22d8
  AcquireTime:     <unset>
  RenewTime:       Tue, 08 Aug 2023 19:24:45 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:37 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.17.0.2
  Hostname:    8bb90a7d22d8
Capacity:
  cpu:                8
  ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
System Info:
  Machine ID:
  System UUID:                324ab640-d7da-11dd-b4b7-b06ebfc7723f
  Boot ID:                    fe7b217c-8f61-4568-93b0-a897170f1db9
  Kernel Version:             5.15.0-78-generic
  OS Image:                   K3s dev
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  <containerd://1.6.6-k3s1>
  Kubelet Version:            v1.24.4+k3s1
  Kube-Proxy Version:         v1.24.4+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   <k3s://8bb90a7d22d8>
Non-terminated Pods:          (10 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-b96499967-8lz84                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3m30s
  kube-system                 local-path-provisioner-7b7dc8d6f5-q2qtw                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-kubernetes-dashboard-6757db879c-txvl5    100m (1%)     2 (25%)     200Mi (0%)       200Mi (0%)     3m30s
  kube-system                 metrics-server-668d979685-7xl5k                        100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         3m30s
  flyte                       flyte-sandbox-docker-registry-7ddfcc58ff-zvlvj         0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-proxy-d95874857-4g2p7                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-postgresql-0                             250m (3%)     0 (0%)      256Mi (0%)       0 (0%)         3m30s
  flyte                       flyte-sandbox-minio-645c8ddf7c-wkgdk                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-buildkit-7d7d55dbb-kh949                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-98749fb56-bsvsw                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                550m (6%)   2 (25%)
  memory             596Mi (0%)  370Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   Starting                 2m22s                  kube-proxy
  Normal   Starting                 2m24s                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      2m24s                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasSufficientPID
  Normal   Synced                   2m23s                  cloud-node-controller  Node synced successfully
  Normal   NodeAllocatableEnforced  2m23s                  kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode           2m20s                  node-controller        Node 8bb90a7d22d8 event: Registered Node 8bb90a7d22d8 in Controller
  Normal   NodeReady                2m13s                  kubelet                Node 8bb90a7d22d8 status is now: NodeReady
Here’s the one for
make start
Copy code
kubectl describe nodes 9a694765b4ca
Name:               9a694765b4ca
Roles:              control-plane,master
Labels:             <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
                    <http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
                    <http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
                    <http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
                    <http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
                    <http://kubernetes.io/hostname=9a694765b4ca|kubernetes.io/hostname=9a694765b4ca>
                    <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
                    <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
                    <http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
                    <http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations:        <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"22:24:c9:c3:a8:f8"}
                    <http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
                    <http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
                    <http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
                    <http://k3s.io/hostname|k3s.io/hostname>: 9a694765b4ca
                    <http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
                    <http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
                    <http://k3s.io/node-config-hash|k3s.io/node-config-hash>: 4J5AJPVISIGUNB3TJQ23D56IPSX5PBUDZLHDT7KGRNZTZCZG2KXA====
                    <http://k3s.io/node-env|k3s.io/node-env>:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cc73cc54f96e096349faa656c251469ebefa4f9ab0d3f356ea6895ff145dcd1e","K3S_KUBECONFIG_OUTPUT":"/.ku...
                    <http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
                    <http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp:  Tue, 08 Aug 2023 17:59:08 +0530
Taints:             <http://node.kubernetes.io/disk-pressure:NoSchedule|node.kubernetes.io/disk-pressure:NoSchedule>
Unschedulable:      false
Lease:
  HolderIdentity:  9a694765b4ca
  AcquireTime:     <unset>
  RenewTime:       Tue, 08 Aug 2023 18:35:39 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:31:57 +0530   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.17.0.2
  Hostname:    9a694765b4ca
Capacity:
  cpu:                8
  ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  <http://nvidia.com/gpu|nvidia.com/gpu>:     1
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  <http://nvidia.com/gpu|nvidia.com/gpu>:     1
  pods:               110
System Info:
  Machine ID:
  System UUID:                324ab640-d7da-11dd-b4b7-b06ebfc7723f
  Boot ID:                    fe7b217c-8f61-4568-93b0-a897170f1db9
  Kernel Version:             5.15.0-78-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  <containerd://1.6.12-k3s1>
  Kubelet Version:            v1.24.9+k3s1
  Kube-Proxy Version:         v1.24.9+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   <k3s://9a694765b4ca>
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 nvidia-device-plugin-daemonset-45mbf       0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 metrics-server-667586758d-8fflg            100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         37m
  kube-system                 coredns-7b5bbc6644-hsk6n                   100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     37m
  kube-system                 local-path-provisioner-687d6d7765-dsgmm    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (2%)   0 (0%)
  memory             140Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  <http://nvidia.com/gpu|nvidia.com/gpu>     0           0
Events:
  Type     Reason                   Age                  From                   Message
  ----     ------                   ----                 ----                   -------
  Normal   Starting                 36m                  kube-proxy
  Normal   Synced                   36m                  cloud-node-controller  Node synced successfully
  Normal   Starting                 36m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      36m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  36m                  kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode           36m                  node-controller        Node 9a694765b4ca event: Registered Node 9a694765b4ca in Controller
  Warning  FreeDiskSpaceFailed      31m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            31m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            26m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      26m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
  Warning  ImageGCFailed            21m                  kubelet                failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      21m                  kubelet                failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      16m                  kubelet                failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
  Warning  ImageGCFailed            16m                  kubelet                failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
  Warning  ImageGCFailed            11m                  kubelet                failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      11m                  kubelet                failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      6m33s                kubelet                failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
  Warning  ImageGCFailed            6m33s                kubelet                failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
  Normal   NodeNotReady             6m25s (x2 over 17m)  node-controller        Node 9a694765b4ca status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasSufficientPID
  Normal   NodeReady                6m19s (x3 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeReady
  Warning  EvictionThresholdMet     3m52s                kubelet                Attempting to reclaim ephemeral-storage
  Warning  FreeDiskSpaceFailed      92s                  kubelet                failed to garbage collect required amount of images. Wanted to free 142914827059 bytes, but freed 284035241 bytes
j
hmm I see, ok.
g
Could removing the
manifest-gpu
from the Makefile in
build-gpu
be the culprit ? 🤔
j
Allocatable ephemeral-storage looks to be the same in both cases.
Try this: “docker system prune -a —volumes”
that should be 2 dashes. i’m on my phone :(
also try deleting any unused images in “docker images”
then prune again
g
I did only this
docker system prune
sometime back to see if help, it didn’t. Do you think this
docker system prune -a —volumes
would be worth the shot ?
j
yes
2 dashes before volumes
g
2 dashes before volumes
Ok. Sure, let me do that.
j
you can also check your disk usage with df -h
g
Copy code
❯ df -H
Filesystem                         Size  Used Avail Use% Mounted on
udev                                34G     0   34G   0% /dev
tmpfs                              6.8G  2.3M  6.8G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv  968G  838G   81G  92% /
tmpfs                               34G   52M   34G   1% /dev/shm
tmpfs                              5.3M  4.1k  5.3M   1% /run/lock
tmpfs                               34G     0   34G   0% /sys/fs/cgroup
/dev/loop2                          59M   59M     0 100% /snap/core18/2785
/dev/loop5                          97M   97M     0 100% /snap/lxd/24061
/dev/loop3                          97M   97M     0 100% /snap/lxd/23991
/dev/loop1                          67M   67M     0 100% /snap/core20/1950
/dev/loop6                          56M   56M     0 100% /snap/snapd/19457
/dev/loop7                          56M   56M     0 100% /snap/snapd/19361
/dev/loop8                         337M  337M     0 100% /snap/vlc/3078
/dev/loop4                          67M   67M     0 100% /snap/core20/1974
/dev/loop0                          59M   59M     0 100% /snap/core18/2751
/dev/sda2                           11G  434M  9.5G   5% /boot
/dev/sda1                          5.4G  5.5M  5.4G   1% /boot/efi
tmpfs                              6.8G   21k  6.8G   1% /run/user/1776609218
Looks pretty full. 92%
j
That might explain it. You are effectively working with <100G. did the prune free up space?
g
I was just taking a break. Just triggered it
Anyways for the first issue of ContainerTask, there is a regression as it was working with 1.8.1
j
weird things happen under disk pressure unfortunately.
Also the branch is significantly behind 1.8.1
g
well, claimed 100G from
docker prune
and 100G from deleting
cache
. It deleted the sandbox-gpu image as well. Let me build it again and try
FYI for the case, if I do
flytectl demo start --image <>
, container exiting immidiately.
That worked !!!! Damn Disc Usage 😓 Thank you so much @jeev 🙌
Now, node spec is same for both normal and
make start
as well.
j
👍
b
Great! 🙂
g
Update:
flytectl demo start --image <>
is working fine with GPU compatible sandbox image with these changes https://github.com/flyteorg/flyte/pull/3256#issuecomment-1670780686
I’m still facing the issue with the 1st issue of ContainerTask execution. I wanted this task to be cached so I defined the signature as below recommended by @Ketan (kumare3) earlier
Copy code
my_task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="1.2"),
    name="my_task",
    image="my-task-image",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inDir=str),
    outputs=kwtypes(out=str),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],
)
As done above, I’m creating
out
file in
/var/outputs
. This was working in flyte version 1.8.1, however in the master version, I’m getting the below error during task execution in UI.
Copy code
[1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution
Note that, If I comment the line
outputs=kwtypes(out=str)
, execution passed, but the task is not cached. I’m assuming some change happened between 1.8.1 and latest version that changed this behaviour.
Hi @Ketan (kumare3) @jeev Could you please let me know why this above config for ContainerTask is not working in the latest flyte version. I need to get this task cached as it’s quite time-consuming execution to run them everytime.
k
Outputs are not cached because they were not generated
You don’t have it in signatures so will be disabled
If this is a regression we will fix it - can you downgrade and it works?
g
I’m generating them here
Copy code
arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],
If this is a regression we will fix it - can you downgrade and it works?
this works in 1.8.1
k
Can you use 1.8.1
Cc @Eduardo Apolinario (eapolinario) regression? Weird one
g
If I check the contents of
/var/outputs/out
it is
out
. It was supposed to be
"out"
k
And you are saying caching in 1.9.0 for container task does not work right? AFK will try later
g
If I manually change it to
"out"
during execution, it is getting cached. Not sure what leads to this behavior
If I check the contents of
/var/outputs/out
it is
out
. It was supposed to be
"out"
This is the problem. It’s not putting
"out"
which it should I guess, it’s putting
out
k
Wdym manually change to out
g
During execution, I went to the pod (
kubectl exec ..
) and then change the file content to
"out"
, it worked in that case
k
Like added quotes? Hmm, this implies we are not casting string correctly- but if you do not change to quotes what outputs do you see in Ui
Ok give us a day it’s 6:00 am out time
g
image.png
this implies we are not casting string correctly
Yes, I believe
@Ketan (kumare3) I downgraded to 1.8.1, it worked with the below code as expected.
Copy code
arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],
So, there is definitely a regression on caching for ContainerTask on master branch.
e
@Gaurav Kumar, I can't repro this on a sandbox built off of master (specifically this sandbox image: ghcr.io/flyteorg/flyte-sandbox-bundled:sha-dfb56f4639a57d519d8fc48cae7d192a385fc160). I ran the following example:
Copy code
from flytekit import ContainerTask, TaskMetadata, kwtypes

my_task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="v1"),
    name="my_task",
    image="ubuntu:latest",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(),
    outputs=kwtypes(out=str),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out",
    ],
)
Also, minor I know, but I see the escaped double-quotes in the output using the
ubuntu:latest
image.
How are you building this sandbox image in your case?
g
I’ve manually built the image from the master repo using
make build
in
docker/sandbox-bundled
.
Have a look at this screenshot for your reference for non-working case with image built on master repo. You can look at the inputs given to the right. UI escape the
\
, that’s why showing
echo "out"
, but it was
echo \"out\"
in the code
Here’s the screenshot of working scenario, after I downgraded to 1.8.1, inputs are exactly same
e
very interesting. I can't repro this yet (also built an image off of master locally). Mind expanding the command you're running? Is there any chance that the "out" file is being removed?
g
No. out is not removed anywhere in the cmd. Same cmd is being passed in both the versions. One point to mention is that, one of the cmd takes raw string as input, something like this
"echo \"out\" > /var/outputs/out; ... ; my-bin --arguments '{\"arg_1\": \"\\\"1\\\"\", \"arg_2\": \"\\\"2\\\"\"}' "
. Not sure if that plays the role.
e
@Gaurav Kumar, I'm having a hard time reproing this error. Can you push your local changes to https://github.com/flyteorg/flyte/pull/3256 ? Also, just to rule out any weirdness with the echo binary, can you talk a bit about the image you're using as the base image?
g
@Eduardo Apolinario (eapolinario) I’m using the image built on top of https://github.com/flyteorg/flyte/pull/3256 only. Below are the only extra changes which I had to to make it work for my task execution which needs GPU. 1. Change NVIDIA/cuda base image for compatibility with my drivers 2. Add env variables related to NVIDIA drivers for task execution 3. Remove
manifests-gpu
from
build-gpu
target as it was failing with helm parsing issue for me, if I trigger
make build-gpu
with that. Removing
manifest-gpu
worked for me.
Copy code
diff --git a/docker/sandbox-bundled/Dockerfile.gpu b/docker/sandbox-bundled/Dockerfile.gpu
index 93d8afbb6..6b5ec90b6 100644
--- a/docker/sandbox-bundled/Dockerfile.gpu
+++ b/docker/sandbox-bundled/Dockerfile.gpu
@@ -26,7 +26,7 @@ RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/r
 # syntax=docker/dockerfile:1.4-labs

 #Following
-FROM nvidia/cuda:11.4.3-base-ubuntu20.04
+FROM nvidia/cuda:11.4.0-base-ubuntu20.04

 RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

@@ -76,9 +76,6 @@ VOLUME /var/lib/rancher/k3s
 VOLUME /var/lib/cni
 VOLUME /var/log

-ENV NVIDIA_VISIBLE_DEVICES="all"
-ENV NVIDIA_DRIVER_CAPABILITIES="all"
-RUN nvidia-ctk runtime configure --runtime=docker --set-as-default

 ENTRYPOINT [ "/bin/k3d-entrypoint.sh" ]
-CMD [ "server", "--disable=traefik", "--disable=servicelb", "--kubelet-arg=allowed-unsafe-sysctls=fs.mqueue.*" ]
+CMD [ "server", "--disable=traefik", "--disable=servicelb" ]
diff --git a/docker/sandbox-bundled/Makefile b/docker/sandbox-bundled/Makefile
index 9eb9970f0..e0e32e530 100644
--- a/docker/sandbox-bundled/Makefile
+++ b/docker/sandbox-bundled/Makefile
@@ -44,7 +44,7 @@ build: flyte manifests
                --tag flyte-sandbox:latest .

 .PHONY: build-gpu
-build-gpu: flyte
+build-gpu: flyte manifests-gpu
        [ -n "$(shell docker buildx ls | awk '/^flyte-sandbox / {print $$1}')" ] || \
                 docker buildx create --name flyte-sandbox \
                 --driver docker-container --driver-opt image=moby/buildkit:master \
So, I applied this https://github.com/flyteorg/flyte/pull/3256 on top of 1.8.1 and on master. It’s working for 1.8.1 but not for the master. It’s has nothing to do with the binary image. I’m seeing this with different tasks as well.
This is purely magic, even I’m bit confused. 🤔
j
@Gaurav kumar what version of copilot is it running? should be a sidecar on the pod.
176 Views