Hi I have a ContianerTask as shown below ```my task = Contai Flyte #flyte-support

Hi, I have a ContianerTask as shown below ```my-ta...

incalculable-ice-13425

08/08/2023, 12:12 PM

Hi, I have a ContianerTask as shown below

Copy code

my-task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="1.0"),
    name="my-task",
    image="my-image",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inDir=str),
    outputs=kwtypes(out=str),
    requests=Resources(gpu="1"),
    limits=Resources(gpu="1"),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands"
        ],
   ....
)

I wanted to cache the task, for which I found that I had to put inputs/outputs even though I don’t need them. So, I just a string “out” in

/var/outputs/out

as shown in the

arguments

and put a string in the

inDir

as below while calling the task.

Copy code

@workflow
def aeb_sanity_workflow(data: Dict):
    ## -----------------------------------------------------------------------------
    .......
    my_task_promise = my-task(inDir="some string")
    ........

This was working for me with earlier version of Flyte mentioned below

<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>

However, I use the master version of flyte patched with https://github.com/flyteorg/flyte/pull/3256 and manually built in

docker/sandbox-bundled

using

make build-gpu

because I needed gpu support in sandbox. I’m seeing that with this latest version, I saw two issues which were not there with
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>

tag: v1.8.1

1. For the above mentioned ContainerTask, It’s throwing errors saying output doesn’t exist after workflow execution. I haven’t changed a single line of code in in the

my-task

except the latest flyte image. 2. Also, for the task that needs GPU, since, the image size is huge ~24 GB, k8 node came under disk pressure, and severals pods were evicted.

Copy code

> kubectl describe pod <gpu-pod>
  Warning  Evicted              8m10s (x3 over 9m30s)  kubelet            The node was low on resource: ephemeral-storage.
  Warning  ExceededGracePeriod  8m (x3 over 9m20s)     kubelet            Container runtime did not kill the pod within specified grace period.
  Normal   Pulled               7m59s                  kubelet            Successfully pulled image "my-gpu-image" in 8m40.26240502s
  Normal   Created              7m59s                  kubelet            Created container primary
  Normal   Started              7m58s                  kubelet            Started container primary
  Normal   Killing              7m58s                  kubelet            Stopping container primary
  Warning  Evicted              7m30s                  kubelet            The node was low on resource: ephemeral-storage. Container primary was using 13516Ki, which exceeds its request of 0.

Copy code

> kubectl describe nodes <>
Warning  FreeDiskSpaceFailed      52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  FreeDiskSpaceFailed      47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  ImageGCFailed            47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet     7m56s (x3 over 11m)    kubelet                Attempting to reclaim ephemeral-storage
  Normal   NodeNotReady             7m49s                  node-controller        Node 1fefe346c083 status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  7m47s (x3 over 57m)    kubelet                Node 1fefe346c083 status is now: NodeHasSufficientMemory
  Normal   NodeHasDiskPressure      7m47s (x2 over 11m)    kubelet                Node 1fefe346c083 status is now: NodeHasDiskPressure

I didn’t observe these issues in this image

<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>

tag: v1.8.1

incalculable-ice-13425

08/08/2023, 12:14 PM

tagging @freezing-boots-56761 since you are aware about the context.

incalculable-ice-13425

08/08/2023, 12:34 PM

error for the first issue

incalculable-ice-13425

08/08/2023, 12:34 PM

error for second issue

freezing-boots-56761

08/08/2023, 1:31 PM

How big of a disk are you using? If on a mac, what is the allocation?

freezing-boots-56761

08/08/2023, 1:31 PM

The nvidia gpu image is large

incalculable-ice-13425

08/08/2023, 1:36 PM

I think the issue is the way we are starting the cluster. For the one with latest image with patch for gpus, I started cluster by doing this

Copy code

1. make build-gpu
2. make start

I couldn’t use

flytectl demo start --image <>

because it’s exiting immidiately. This is because

--gpus all

was required in docker run which I manually added in Makefile before calling

make start

. This is what I found the

kubectl describe node <>

in this case

Copy code

Non-terminated Pods:          (4 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 nvidia-device-plugin-daemonset-45mbf       0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 metrics-server-667586758d-8fflg            100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         37m
  kube-system                 coredns-7b5bbc6644-hsk6n                   100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     37m
  kube-system                 local-path-provisioner-687d6d7765-dsgmm    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (2%)   0 (0%)
  memory             140Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  <http://nvidia.com/gpu|nvidia.com/gpu>     0           0

However, In the case of normal sandbox cluster using

flytectl demo start

, below was the node capacitiy

Copy code

Non-terminated Pods:          (9 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 local-path-provisioner-7b7dc8d6f5-lb2rx                0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-kubernetes-dashboard-6757db879c-vfvnw    100m (1%)     2 (25%)     200Mi (0%)       200Mi (0%)     4m57s
  kube-system                 coredns-b96499967-h5np6                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     4m57s
  kube-system                 metrics-server-668d979685-5dvtl                        100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         4m57s
  flyte                       flyte-sandbox-docker-registry-6494d7666-mpq9t          0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-proxy-d95874857-v5vrh                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-postgresql-0                             250m (3%)     0 (0%)      256Mi (0%)       0 (0%)         4m57s
  flyte                       flyte-sandbox-minio-645c8ddf7c-h5wbn                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
  flyte                       flyte-sandbox-69c7f848db-g9psq                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4m57s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                550m (6%)   2 (25%)
  memory             596Mi (0%)  370Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)

I can see that there is lot of difference in the nodes capapcities for both the cases, which is causing 1st one to crash because of disk issue.

incalculable-ice-13425

08/08/2023, 1:38 PM

Please open the thread in new window for better visibilities, apologies. Is there any way I can start the cluster for the 1st method with gpu support that adds necessary capacities as done in normal case along with adding

--gpus all

to docker run

freezing-boots-56761

08/08/2023, 1:38 PM

iirc, in the PR, they were able to get it working by modifying the docker config only, as opposed to setting the gpus flag. if so, flytectl demo start should work. did you try that?

incalculable-ice-13425

08/08/2023, 1:41 PM

No.

flytectl demo start --image <>

is not working. I have the docker config with NVIDIA runtime only. We need to add

--gpus all

to docker run, otherwise sandbox will not have access to gpu and that’s why it’s crashing immidiately. Other user in the patch comment also faced the same issue.

incalculable-ice-13425

08/08/2023, 1:42 PM

For

--gpus all

to work, we must have docker config with NVIDIA runtime, so that part is correct. Only missing part is calling docker run

--gpus all

with

flytectl demo start --image<>

cmd.

freezing-boots-56761

08/08/2023, 1:42 PM

hmm i see. ok. w.r.t to the capacities, it looks like the difference is just due to a missing Flyte namespace.

freezing-boots-56761

08/08/2023, 1:42 PM

is that because the Flyte pods are crashlooping?

incalculable-ice-13425

08/08/2023, 1:43 PM

hmm i see. ok. w.r.t to the capacities, it looks like the difference is just due to a missing Flyte namespace.

Look at the

Allocated resources:

, even that is less for the 1st case

make start

freezing-boots-56761

08/08/2023, 1:44 PM

Yes because there are less pods. You want to look at Allocatable capacity, a bit higher in the output

incalculable-ice-13425

08/08/2023, 1:44 PM

is that because the Flyte pods are crashlooping?

It’s happening during the stage when pod pulls the image ~24Gb, when node starts falliing short of ephemeral storage. It’s working fine in normal case

flytectl demo start

. Tried multiple times

freezing-boots-56761

08/08/2023, 1:44 PM

what machine is this running on?

freezing-boots-56761

08/08/2023, 1:45 PM

how big is your disk rather?

incalculable-ice-13425

08/08/2023, 1:46 PM

linux. 1TB, 64GB RAM

incalculable-ice-13425

08/08/2023, 1:47 PM

Is there a way we can make the node capacities same when done using

make start

or pass the

--gpus all

flag when calling using

flytectl demo start --image <>

. I think either of them will solve the issue.

freezing-boots-56761

08/08/2023, 1:53 PM

The node capacities should be the same I’d think. Can you paste the full output of “kubectl describe nodes” in both scenarios?

freezing-boots-56761

08/08/2023, 1:54 PM

Passing the gpus flag will require a change to flytectl. GPU support hasn’t been planned yet afaik.

freezing-boots-56761

08/08/2023, 1:55 PM

I don’t see why this won’t work with a 1TB disk.

freezing-boots-56761

08/08/2023, 1:58 PM

https://github.com/flyteorg/flyte/pull/3256#issuecomment-1472279870 doesn’t work?

incalculable-ice-13425

08/08/2023, 2:00 PM

https://github.com/flyteorg/flyte/pull/3256#issuecomment-1472279870 doesn’t work?

Yes, tried multiple times, doesn’t work. I’ve patched his changed on master repo, then did

make build-gpu

and

make start

. Note that, I had to remove

manifest-gpu

from

build-gpu

for

make build-gpu

to produce a docker image.

incalculable-ice-13425

08/08/2023, 2:01 PM

Here’s the one for normal case

Copy code

kubectl describe nodes 8bb90a7d22d8
Name:               8bb90a7d22d8
Roles:              control-plane,master
Labels:             <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
                    <http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
                    <http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
                    <http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
                    <http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
                    <http://kubernetes.io/hostname=8bb90a7d22d8|kubernetes.io/hostname=8bb90a7d22d8>
                    <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
                    <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
                    <http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
                    <http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations:        <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"e6:0c:35:ea:3b:5b"}
                    <http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
                    <http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
                    <http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
                    <http://k3s.io/hostname|k3s.io/hostname>: 8bb90a7d22d8
                    <http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
                    <http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
                    <http://k3s.io/node-config-hash|k3s.io/node-config-hash>: DTM2Y77ISYLRTGPA5HIDT5VGTAXMZ5BQ5HF4OYULEA4KHR2EII4A====
                    <http://k3s.io/node-env|k3s.io/node-env>: {"K3S_KUBECONFIG_OUTPUT":"/var/lib/flyte/config/kubeconfig"}
                    <http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
                    <http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp:  Tue, 08 Aug 2023 19:22:27 +0530
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  8bb90a7d22d8
  AcquireTime:     <unset>
  RenewTime:       Tue, 08 Aug 2023 19:24:45 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:26 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 08 Aug 2023 19:22:58 +0530   Tue, 08 Aug 2023 19:22:37 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.17.0.2
  Hostname:    8bb90a7d22d8
Capacity:
  cpu:                8
  ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  pods:               110
System Info:
  Machine ID:
  System UUID:                324ab640-d7da-11dd-b4b7-b06ebfc7723f
  Boot ID:                    fe7b217c-8f61-4568-93b0-a897170f1db9
  Kernel Version:             5.15.0-78-generic
  OS Image:                   K3s dev
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  <containerd://1.6.6-k3s1>
  Kubelet Version:            v1.24.4+k3s1
  Kube-Proxy Version:         v1.24.4+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   <k3s://8bb90a7d22d8>
Non-terminated Pods:          (10 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  kube-system                 coredns-b96499967-8lz84                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3m30s
  kube-system                 local-path-provisioner-7b7dc8d6f5-q2qtw                0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-kubernetes-dashboard-6757db879c-txvl5    100m (1%)     2 (25%)     200Mi (0%)       200Mi (0%)     3m30s
  kube-system                 metrics-server-668d979685-7xl5k                        100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         3m30s
  flyte                       flyte-sandbox-docker-registry-7ddfcc58ff-zvlvj         0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-proxy-d95874857-4g2p7                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-postgresql-0                             250m (3%)     0 (0%)      256Mi (0%)       0 (0%)         3m30s
  flyte                       flyte-sandbox-minio-645c8ddf7c-wkgdk                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-buildkit-7d7d55dbb-kh949                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
  flyte                       flyte-sandbox-98749fb56-bsvsw                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m30s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                550m (6%)   2 (25%)
  memory             596Mi (0%)  370Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   Starting                 2m22s                  kube-proxy
  Normal   Starting                 2m24s                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      2m24s                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m24s (x2 over 2m24s)  kubelet                Node 8bb90a7d22d8 status is now: NodeHasSufficientPID
  Normal   Synced                   2m23s                  cloud-node-controller  Node synced successfully
  Normal   NodeAllocatableEnforced  2m23s                  kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode           2m20s                  node-controller        Node 8bb90a7d22d8 event: Registered Node 8bb90a7d22d8 in Controller
  Normal   NodeReady                2m13s                  kubelet                Node 8bb90a7d22d8 status is now: NodeReady

incalculable-ice-13425

08/08/2023, 2:02 PM

Here’s the one for

make start

Copy code

kubectl describe nodes 9a694765b4ca
Name:               9a694765b4ca
Roles:              control-plane,master
Labels:             <http://beta.kubernetes.io/arch=amd64|beta.kubernetes.io/arch=amd64>
                    <http://beta.kubernetes.io/instance-type=k3s|beta.kubernetes.io/instance-type=k3s>
                    <http://beta.kubernetes.io/os=linux|beta.kubernetes.io/os=linux>
                    <http://egress.k3s.io/cluster=true|egress.k3s.io/cluster=true>
                    <http://kubernetes.io/arch=amd64|kubernetes.io/arch=amd64>
                    <http://kubernetes.io/hostname=9a694765b4ca|kubernetes.io/hostname=9a694765b4ca>
                    <http://kubernetes.io/os=linux|kubernetes.io/os=linux>
                    <http://node-role.kubernetes.io/control-plane=true|node-role.kubernetes.io/control-plane=true>
                    <http://node-role.kubernetes.io/master=true|node-role.kubernetes.io/master=true>
                    <http://node.kubernetes.io/instance-type=k3s|node.kubernetes.io/instance-type=k3s>
Annotations:        <http://flannel.alpha.coreos.com/backend-data|flannel.alpha.coreos.com/backend-data>: {"VNI":1,"VtepMAC":"22:24:c9:c3:a8:f8"}
                    <http://flannel.alpha.coreos.com/backend-type|flannel.alpha.coreos.com/backend-type>: vxlan
                    <http://flannel.alpha.coreos.com/kube-subnet-manager|flannel.alpha.coreos.com/kube-subnet-manager>: true
                    <http://flannel.alpha.coreos.com/public-ip|flannel.alpha.coreos.com/public-ip>: 172.17.0.2
                    <http://k3s.io/hostname|k3s.io/hostname>: 9a694765b4ca
                    <http://k3s.io/internal-ip|k3s.io/internal-ip>: 172.17.0.2
                    <http://k3s.io/node-args|k3s.io/node-args>: ["server","--disable","traefik","--disable","servicelb"]
                    <http://k3s.io/node-config-hash|k3s.io/node-config-hash>: 4J5AJPVISIGUNB3TJQ23D56IPSX5PBUDZLHDT7KGRNZTZCZG2KXA====
                    <http://k3s.io/node-env|k3s.io/node-env>:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/cc73cc54f96e096349faa656c251469ebefa4f9ab0d3f356ea6895ff145dcd1e","K3S_KUBECONFIG_OUTPUT":"/.ku...
                    <http://node.alpha.kubernetes.io/ttl|node.alpha.kubernetes.io/ttl>: 0
                    <http://volumes.kubernetes.io/controller-managed-attach-detach|volumes.kubernetes.io/controller-managed-attach-detach>: true
CreationTimestamp:  Tue, 08 Aug 2023 17:59:08 +0530
Taints:             <http://node.kubernetes.io/disk-pressure:NoSchedule|node.kubernetes.io/disk-pressure:NoSchedule>
Unschedulable:      false
Lease:
  HolderIdentity:  9a694765b4ca
  AcquireTime:     <unset>
  RenewTime:       Tue, 08 Aug 2023 18:35:39 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     True    Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:31:57 +0530   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure      False   Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 08 Aug 2023 18:35:11 +0530   Tue, 08 Aug 2023 18:29:23 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.17.0.2
  Hostname:    9a694765b4ca
Capacity:
  cpu:                8
  ephemeral-storage:  944801904Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  <http://nvidia.com/gpu|nvidia.com/gpu>:     1
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  919103291491
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65774996Ki
  <http://nvidia.com/gpu|nvidia.com/gpu>:     1
  pods:               110
System Info:
  Machine ID:
  System UUID:                324ab640-d7da-11dd-b4b7-b06ebfc7723f
  Boot ID:                    fe7b217c-8f61-4568-93b0-a897170f1db9
  Kernel Version:             5.15.0-78-generic
  OS Image:                   Ubuntu 20.04.6 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  <containerd://1.6.12-k3s1>
  Kubelet Version:            v1.24.9+k3s1
  Kube-Proxy Version:         v1.24.9+k3s1
PodCIDR:                      10.42.0.0/24
PodCIDRs:                     10.42.0.0/24
ProviderID:                   <k3s://9a694765b4ca>
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                       ------------  ----------  ---------------  -------------  ---
  kube-system                 nvidia-device-plugin-daemonset-45mbf       0 (0%)        0 (0%)      0 (0%)           0 (0%)         11m
  kube-system                 metrics-server-667586758d-8fflg            100m (1%)     0 (0%)      70Mi (0%)        0 (0%)         37m
  kube-system                 coredns-7b5bbc6644-hsk6n                   100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     37m
  kube-system                 local-path-provisioner-687d6d7765-dsgmm    0 (0%)        0 (0%)      0 (0%)           0 (0%)         37m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                200m (2%)   0 (0%)
  memory             140Mi (0%)  170Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  <http://nvidia.com/gpu|nvidia.com/gpu>     0           0
Events:
  Type     Reason                   Age                  From                   Message
  ----     ------                   ----                 ----                   -------
  Normal   Starting                 36m                  kube-proxy
  Normal   Synced                   36m                  cloud-node-controller  Node synced successfully
  Normal   Starting                 36m                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      36m                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeAllocatableEnforced  36m                  kubelet                Updated Node Allocatable limit across pods
  Normal   RegisteredNode           36m                  node-controller        Node 9a694765b4ca event: Registered Node 9a694765b4ca in Controller
  Warning  FreeDiskSpaceFailed      31m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            31m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104807682867 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            26m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      26m                  kubelet                failed to garbage collect required amount of images. Wanted to free 104841306931 bytes, but freed 0 bytes
  Warning  ImageGCFailed            21m                  kubelet                failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      21m                  kubelet                failed to garbage collect required amount of images. Wanted to free 106252362547 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      16m                  kubelet                failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
  Warning  ImageGCFailed            16m                  kubelet                failed to garbage collect required amount of images. Wanted to free 110294586163 bytes, but freed 0 bytes
  Warning  ImageGCFailed            11m                  kubelet                failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      11m                  kubelet                failed to garbage collect required amount of images. Wanted to free 112607851315 bytes, but freed 0 bytes
  Warning  FreeDiskSpaceFailed      6m33s                kubelet                failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
  Warning  ImageGCFailed            6m33s                kubelet                failed to garbage collect required amount of images. Wanted to free 135589974835 bytes, but freed 0 bytes
  Normal   NodeNotReady             6m25s (x2 over 17m)  node-controller        Node 9a694765b4ca status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     6m19s (x4 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeHasSufficientPID
  Normal   NodeReady                6m19s (x3 over 36m)  kubelet                Node 9a694765b4ca status is now: NodeReady
  Warning  EvictionThresholdMet     3m52s                kubelet                Attempting to reclaim ephemeral-storage
  Warning  FreeDiskSpaceFailed      92s                  kubelet                failed to garbage collect required amount of images. Wanted to free 142914827059 bytes, but freed 284035241 bytes

freezing-boots-56761

08/08/2023, 2:02 PM

hmm I see, ok.

incalculable-ice-13425

08/08/2023, 2:03 PM

Could removing the

manifest-gpu

from the Makefile in

build-gpu

be the culprit ? 🤔

freezing-boots-56761

08/08/2023, 2:04 PM

Allocatable ephemeral-storage looks to be the same in both cases.

freezing-boots-56761

08/08/2023, 2:04 PM

Try this: “docker system prune -a —volumes”

freezing-boots-56761

08/08/2023, 2:05 PM

that should be 2 dashes. i’m on my phone :(

freezing-boots-56761

08/08/2023, 2:05 PM

also try deleting any unused images in “docker images”

freezing-boots-56761

08/08/2023, 2:05 PM

then prune again

incalculable-ice-13425

08/08/2023, 2:08 PM

I did only this

docker system prune

sometime back to see if help, it didn’t. Do you think this

docker system prune -a —volumes

would be worth the shot ?

freezing-boots-56761

08/08/2023, 2:09 PM

yes

freezing-boots-56761

08/08/2023, 2:10 PM

2 dashes before volumes

incalculable-ice-13425

08/08/2023, 2:10 PM

2 dashes before volumes

Ok. Sure, let me do that.

freezing-boots-56761

08/08/2023, 2:10 PM

you can also check your disk usage with df -h

incalculable-ice-13425

08/08/2023, 2:18 PM

Copy code

❯ df -H
Filesystem                         Size  Used Avail Use% Mounted on
udev                                34G     0   34G   0% /dev
tmpfs                              6.8G  2.3M  6.8G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv  968G  838G   81G  92% /
tmpfs                               34G   52M   34G   1% /dev/shm
tmpfs                              5.3M  4.1k  5.3M   1% /run/lock
tmpfs                               34G     0   34G   0% /sys/fs/cgroup
/dev/loop2                          59M   59M     0 100% /snap/core18/2785
/dev/loop5                          97M   97M     0 100% /snap/lxd/24061
/dev/loop3                          97M   97M     0 100% /snap/lxd/23991
/dev/loop1                          67M   67M     0 100% /snap/core20/1950
/dev/loop6                          56M   56M     0 100% /snap/snapd/19457
/dev/loop7                          56M   56M     0 100% /snap/snapd/19361
/dev/loop8                         337M  337M     0 100% /snap/vlc/3078
/dev/loop4                          67M   67M     0 100% /snap/core20/1974
/dev/loop0                          59M   59M     0 100% /snap/core18/2751
/dev/sda2                           11G  434M  9.5G   5% /boot
/dev/sda1                          5.4G  5.5M  5.4G   1% /boot/efi
tmpfs                              6.8G   21k  6.8G   1% /run/user/1776609218

incalculable-ice-13425

08/08/2023, 2:19 PM

Looks pretty full. 92%

freezing-boots-56761

08/08/2023, 2:19 PM

That might explain it. You are effectively working with <100G. did the prune free up space?

incalculable-ice-13425

08/08/2023, 2:20 PM

I was just taking a break. Just triggered it

👍 1

incalculable-ice-13425

08/08/2023, 2:22 PM

Anyways for the first issue of ContainerTask, there is a regression as it was working with 1.8.1

freezing-boots-56761

08/08/2023, 2:28 PM

weird things happen under disk pressure unfortunately.

freezing-boots-56761

08/08/2023, 2:29 PM

Also the branch is significantly behind 1.8.1

incalculable-ice-13425

08/08/2023, 2:32 PM

well, claimed 100G from

docker prune

and 100G from deleting

cache

. It deleted the sandbox-gpu image as well. Let me build it again and try

incalculable-ice-13425

08/08/2023, 2:45 PM

FYI for the case, if I do

flytectl demo start --image <>

, container exiting immidiately.

Screen Recording 2023-08-08 at 8.11.07 PM.mov

👍 1

incalculable-ice-13425

08/08/2023, 3:14 PM

That worked !!!! Damn Disc Usage 😓 Thank you so much @freezing-boots-56761 🙌

incalculable-ice-13425

08/08/2023, 3:15 PM

Now, node spec is same for both normal and

make start

as well.

freezing-boots-56761

08/08/2023, 3:28 PM

👍

quick-salesclerk-18019

08/08/2023, 3:46 PM

Great! 🙂

incalculable-ice-13425

08/09/2023, 7:02 AM

Update:

flytectl demo start --image <>

is working fine with GPU compatible sandbox image with these changes https://github.com/flyteorg/flyte/pull/3256#issuecomment-1670780686

🎉 2

incalculable-ice-13425

08/10/2023, 8:25 AM

I’m still facing the issue with the 1st issue of ContainerTask execution. I wanted this task to be cached so I defined the signature as below recommended by @freezing-airport-6809 earlier

Copy code

my_task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="1.2"),
    name="my_task",
    image="my-task-image",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inDir=str),
    outputs=kwtypes(out=str),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],
)

As done above, I’m creating

out

file in

/var/outputs

. This was working in flyte version 1.8.1, however in the master version, I’m getting the below error during task execution in UI.

Copy code

[1/1] currentAttempt done. Last Error: UNKNOWN::Outputs not generated by task execution

Note that, If I comment the line

outputs=kwtypes(out=str)

, execution passed, but the task is not cached. I’m assuming some change happened between 1.8.1 and latest version that changed this behaviour.

incalculable-ice-13425

08/10/2023, 1:09 PM

Hi @freezing-airport-6809 @freezing-boots-56761 Could you please let me know why this above config for ContainerTask is not working in the latest flyte version. I need to get this task cached as it’s quite time-consuming execution to run them everytime.

freezing-airport-6809

08/10/2023, 1:26 PM

Outputs are not cached because they were not generated

freezing-airport-6809

08/10/2023, 1:26 PM

You don’t have it in signatures so will be disabled

freezing-airport-6809

08/10/2023, 1:27 PM

If this is a regression we will fix it - can you downgrade and it works?

incalculable-ice-13425

08/10/2023, 1:27 PM

I’m generating them here

Copy code

arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],

incalculable-ice-13425

08/10/2023, 1:27 PM

If this is a regression we will fix it - can you downgrade and it works?

this works in 1.8.1

freezing-airport-6809

08/10/2023, 1:28 PM

Can you use 1.8.1

freezing-airport-6809

08/10/2023, 1:28 PM

Cc @high-accountant-32689 regression? Weird one

incalculable-ice-13425

08/10/2023, 1:29 PM

If I check the contents of

/var/outputs/out

it is

out

. It was supposed to be

"out"

freezing-airport-6809

08/10/2023, 1:30 PM

And you are saying caching in 1.9.0 for container task does not work right? AFK will try later

incalculable-ice-13425

08/10/2023, 1:30 PM

If I manually change it to

"out"

during execution, it is getting cached. Not sure what leads to this behavior

incalculable-ice-13425

08/10/2023, 1:30 PM

If I check the contents of
/var/outputs/out
it is
out
. It was supposed to be
"out"

This is the problem. It’s not putting

"out"

which it should I guess, it’s putting

out

freezing-airport-6809

08/10/2023, 1:31 PM

Wdym manually change to out

incalculable-ice-13425

08/10/2023, 1:31 PM

During execution, I went to the pod (

kubectl exec ..

) and then change the file content to

"out"

, it worked in that case

freezing-airport-6809

08/10/2023, 1:32 PM

Like added quotes? Hmm, this implies we are not casting string correctly- but if you do not change to quotes what outputs do you see in Ui

freezing-airport-6809

08/10/2023, 1:32 PM

Ok give us a day it’s 6:00 am out time

👍 1

incalculable-ice-13425

08/10/2023, 1:34 PM

incalculable-ice-13425

08/10/2023, 1:34 PM

this implies we are not casting string correctly

Yes, I believe

incalculable-ice-13425

08/10/2023, 3:44 PM

@freezing-airport-6809 I downgraded to 1.8.1, it worked with the below code as expected.

Copy code

arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands",
    ],

So, there is definitely a regression on caching for ContainerTask on master branch.

high-accountant-32689

08/10/2023, 6:07 PM

@incalculable-ice-13425, I can't repro this on a sandbox built off of master (specifically this sandbox image: ghcr.io/flyteorg/flyte-sandbox-bundled:sha-dfb56f4639a57d519d8fc48cae7d192a385fc160). I ran the following example:

Copy code

from flytekit import ContainerTask, TaskMetadata, kwtypes

my_task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="v1"),
    name="my_task",
    image="ubuntu:latest",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(),
    outputs=kwtypes(out=str),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out",
    ],
)

Also, minor I know, but I see the escaped double-quotes in the output using the

ubuntu:latest

image.

high-accountant-32689

08/10/2023, 6:08 PM

How are you building this sandbox image in your case?

incalculable-ice-13425

08/10/2023, 6:16 PM

I’ve manually built the image from the master repo using

make build

docker/sandbox-bundled

incalculable-ice-13425

08/10/2023, 6:23 PM

Have a look at this screenshot for your reference for non-working case with image built on master repo. You can look at the inputs given to the right. UI escape the

, that’s why showing

echo "out"

, but it was

echo \"out\"

in the code

incalculable-ice-13425

08/10/2023, 6:27 PM

Here’s the screenshot of working scenario, after I downgraded to 1.8.1, inputs are exactly same

high-accountant-32689

08/10/2023, 6:54 PM

very interesting. I can't repro this yet (also built an image off of master locally). Mind expanding the command you're running? Is there any chance that the "out" file is being removed?

incalculable-ice-13425

08/10/2023, 7:05 PM

No. out is not removed anywhere in the cmd. Same cmd is being passed in both the versions. One point to mention is that, one of the cmd takes raw string as input, something like this

"echo \"out\" > /var/outputs/out; ... ; my-bin --arguments '{\"arg_1\": \"\\\"1\\\"\", \"arg_2\": \"\\\"2\\\"\"}' "

. Not sure if that plays the role.

high-accountant-32689

08/10/2023, 9:50 PM

@incalculable-ice-13425, I'm having a hard time reproing this error. Can you push your local changes to https://github.com/flyteorg/flyte/pull/3256 ? Also, just to rule out any weirdness with the echo binary, can you talk a bit about the image you're using as the base image?

incalculable-ice-13425

08/11/2023, 5:43 AM

@high-accountant-32689 I’m using the image built on top of https://github.com/flyteorg/flyte/pull/3256 only. Below are the only extra changes which I had to to make it work for my task execution which needs GPU. 1. Change NVIDIA/cuda base image for compatibility with my drivers 2. Add env variables related to NVIDIA drivers for task execution 3. Remove

manifests-gpu

from

build-gpu

target as it was failing with helm parsing issue for me, if I trigger

make build-gpu

with that. Removing

manifest-gpu

worked for me.

Copy code

diff --git a/docker/sandbox-bundled/Dockerfile.gpu b/docker/sandbox-bundled/Dockerfile.gpu
index 93d8afbb6..6b5ec90b6 100644
--- a/docker/sandbox-bundled/Dockerfile.gpu
+++ b/docker/sandbox-bundled/Dockerfile.gpu
@@ -26,7 +26,7 @@ RUN --mount=type=cache,target=/root/.cache/go-build --mount=type=cache,target=/r
 # syntax=docker/dockerfile:1.4-labs

 #Following
-FROM nvidia/cuda:11.4.3-base-ubuntu20.04
+FROM nvidia/cuda:11.4.0-base-ubuntu20.04

 RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

@@ -76,9 +76,6 @@ VOLUME /var/lib/rancher/k3s
 VOLUME /var/lib/cni
 VOLUME /var/log

-ENV NVIDIA_VISIBLE_DEVICES="all"
-ENV NVIDIA_DRIVER_CAPABILITIES="all"
-RUN nvidia-ctk runtime configure --runtime=docker --set-as-default

 ENTRYPOINT [ "/bin/k3d-entrypoint.sh" ]
-CMD [ "server", "--disable=traefik", "--disable=servicelb", "--kubelet-arg=allowed-unsafe-sysctls=fs.mqueue.*" ]
+CMD [ "server", "--disable=traefik", "--disable=servicelb" ]
diff --git a/docker/sandbox-bundled/Makefile b/docker/sandbox-bundled/Makefile
index 9eb9970f0..e0e32e530 100644
--- a/docker/sandbox-bundled/Makefile
+++ b/docker/sandbox-bundled/Makefile
@@ -44,7 +44,7 @@ build: flyte manifests
                --tag flyte-sandbox:latest .

 .PHONY: build-gpu
-build-gpu: flyte
+build-gpu: flyte manifests-gpu
        [ -n "$(shell docker buildx ls | awk '/^flyte-sandbox / {print $$1}')" ] || \
                 docker buildx create --name flyte-sandbox \
                 --driver docker-container --driver-opt image=moby/buildkit:master \

incalculable-ice-13425

08/11/2023, 5:45 AM

So, I applied this https://github.com/flyteorg/flyte/pull/3256 on top of 1.8.1 and on master. It’s working for 1.8.1 but not for the master. It’s has nothing to do with the binary image. I’m seeing this with different tasks as well.

incalculable-ice-13425

08/11/2023, 5:46 AM

This is purely magic, even I’m bit confused. 🤔

freezing-boots-56761

08/12/2023, 1:33 AM

@freezing-church-50329 what version of copilot is it running? should be a sidecar on the pod.

368 Views

Open in Slack

Previous Next