Hi all apologies if this is a basic question but I m trying Flyte #flyte-support

Hi all, apologies if this is a basic question, but...

strong-plumber-41198

12/11/2023, 5:14 PM

Hi all, apologies if this is a basic question, but I’m trying to run the MLflow example workflow and the task keeps failing with

OOMKilled

. I’ve tried increasing the

flytepropeller.resources.limits.memory

and

flytepropeller.resources.resources.memory

values, but this didn’t seem to have had any effect

average-finland-92144

12/11/2023, 9:21 PM

hey Jake, have you tried increasing the default task resource requests? https://github.com/unionai-oss/deploy-flyte/blob/db3132ac910ddb8c68a643990ddf10eafb6163d3/environments/gcp/flyte-core/values-gcp-core.yaml#L238-L245

strong-plumber-41198

12/12/2023, 10:36 AM

Hi David, I’ve tried increasing the default up to

10Gi

and it still gets

OOMKilled

at the same point every time, here is the full error log from the UI:

Copy code

[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[fc5d90298b68749979ad-n0-0] terminated with exit code (137). Reason [OOMKilled]. Message: 
1880 [================>.............] - ETA: 0s
17801216/26421880 [===================>..........] - ETA: 0s
20733952/26421880 [======================>.......] - ETA: 0s
22904832/26421880 [=========================>....] - ETA: 0s
25673728/26421880 [============================>.] - ETA: 0s
26421880/26421880 [==============================] - 1s 0us/step
Downloading data from <https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz>

5148/5148 [==============================] - 0s 0us/step
Downloading data from <https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz>

   8192/4422102 [..............................] - ETA: 0s
  49152/4422102 [..............................] - ETA: 7s
  81920/4422102 [..............................] - ETA: 9s
 327680/4422102 [=>............................] - ETA: 2s
 573440/4422102 [==>...........................] - ETA: 1s
 851968/4422102 [====>.........................] - ETA: 1s
1474560/4422102 [=========>....................] - ETA: 0s
2121728/4422102 [=============>................] - ETA: 0s
3375104/4422102 [=====================>........] - ETA: 0s
4422102/4422102 [==============================] - 1s 0us/step
.

average-finland-92144

12/12/2023, 1:03 PM

how's the memory consumption in the node? I'm trying to reproduce the issue

strong-plumber-41198

12/12/2023, 1:34 PM

from running

kubectl top node

, currently there is a node consuming 24% memory

proud-answer-87162

12/12/2023, 4:03 PM

have you confirmed the task pod has 10gb of mem? i had a situation where i had config in two places and the mem of the pod was being overwritten unexpectedly

thx 1

🤔 1

average-finland-92144

12/12/2023, 4:44 PM

@strong-plumber-41198 could you also describe the pod to confirm the actual resources being allocated to the Pod? like

kubectl get pods -n flytesnacks-development

then

kubectl get pod <execution-id-Pod> -n flytesnacks-development -o yaml

and find the

spec.resources

block

strong-plumber-41198

12/12/2023, 4:47 PM

this is the

spec.resources

block:

Copy code

resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi

average-finland-92144

12/12/2023, 8:51 PM

so, if you set your default memory request to 10Gi and no limit is set, Flyte makes request=limit which is more deterministic for the K8s scheduler. Now, if your actual Pod only requests 1Gi, somehow the default resources are not being applied. For now we can probably try overriding adding to the decorator:

Copy code

@task(requests=Resources(
        cpu="1",
        mem="4Gi")

strong-plumber-41198

12/13/2023, 9:32 AM

Hi David, I tried overriding at the task level and got the following error:

Copy code

RPC Failed, with Status: StatusCode.INVALID_ARGUMENT
        details: Requested MEMORY default [4Gi] is greater than current limit set in the platform configuration [1Gi]. Please contact Flyte Admins to change these limits or consult the configuration

strong-plumber-41198

12/13/2023, 9:33 AM

I can share my

values.yaml

with you if that will help troubleshoot this?

👍🏽 1

8 Views

Open in Slack

Previous Next