Hi, I have a ContianerTask as shown below ```my-task = ContainerTask( metadata=TaskMetadata(cach...
g

Gaurav Kumar

about 2 years ago
Hi, I have a ContianerTask as shown below
my-task = ContainerTask(
    metadata=TaskMetadata(cache=True, cache_version="1.0"),
    name="my-task",
    image="my-image",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inDir=str),
    outputs=kwtypes(out=str),
    requests=Resources(gpu="1"),
    limits=Resources(gpu="1"),
    command=[
        "/bin/bash",
    ],
    arguments=[
        "-c",
        "echo \"out\" > /var/outputs/out; ... other commands"
        ],
   ....
)
I wanted to cache the task, for which I found that I had to put inputs/outputs even though I don’t need them. So, I just a string “out” in
/var/outputs/out
as shown in the
arguments
and put a string in the
inDir
as below while calling the task.
@workflow
def aeb_sanity_workflow(data: Dict):
    ## -----------------------------------------------------------------------------
    .......
    my_task_promise = my-task(inDir="some string")
    ........
This was working for me with earlier version of Flyte mentioned below
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
However, I use the master version of flyte patched with https://github.com/flyteorg/flyte/pull/3256 and manually built in
docker/sandbox-bundled
using
make build-gpu
because I needed gpu support in sandbox. I’m seeing that with this latest version, I saw two issues which were not there with
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1
1. For the above mentioned ContainerTask, It’s throwing errors saying output doesn’t exist after workflow execution. I haven’t changed a single line of code in in the
my-task
except the latest flyte image. 2. Also, for the task that needs GPU, since, the image size is huge ~24 GB, k8 node came under disk pressure, and severals pods were evicted.
> kubectl describe pod <gpu-pod>
  Warning  Evicted              8m10s (x3 over 9m30s)  kubelet            The node was low on resource: ephemeral-storage.
  Warning  ExceededGracePeriod  8m (x3 over 9m20s)     kubelet            Container runtime did not kill the pod within specified grace period.
  Normal   Pulled               7m59s                  kubelet            Successfully pulled image "my-gpu-image" in 8m40.26240502s
  Normal   Created              7m59s                  kubelet            Created container primary
  Normal   Started              7m58s                  kubelet            Started container primary
  Normal   Killing              7m58s                  kubelet            Stopping container primary
  Warning  Evicted              7m30s                  kubelet            The node was low on resource: ephemeral-storage. Container primary was using 13516Ki, which exceeds its request of 0.
> kubectl describe nodes <>
Warning  FreeDiskSpaceFailed      52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  ImageGCFailed            52m                    kubelet                failed to garbage collect required amount of images. Wanted to free 110758122291 bytes, but freed 155692522 bytes
  Warning  FreeDiskSpaceFailed      47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  ImageGCFailed            47m                    kubelet                failed to garbage collect required amount of images. Wanted to free 111138763571 bytes, but freed 0 bytes
  Warning  EvictionThresholdMet     7m56s (x3 over 11m)    kubelet                Attempting to reclaim ephemeral-storage
  Normal   NodeNotReady             7m49s                  node-controller        Node 1fefe346c083 status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  7m47s (x3 over 57m)    kubelet                Node 1fefe346c083 status is now: NodeHasSufficientMemory
  Normal   NodeHasDiskPressure      7m47s (x2 over 11m)    kubelet                Node 1fefe346c083 status is now: NodeHasDiskPressure
I didn’t observe these issues in this image
<http://cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a|cr.flyte.org/flyteorg/flyte-sandbox-bundled:sha-1a8d37570cda76cc01bf8c26354f4aad4debcd0a>
tag: v1.8.1