https://flyte.org logo
#ask-the-community
Title
# ask-the-community
j

Jake Dodd

12/11/2023, 5:14 PM
Hi all, apologies if this is a basic question, but I’m trying to run the MLflow example workflow and the task keeps failing with
OOMKilled
. I’ve tried increasing the
flytepropeller.resources.limits.memory
and
flytepropeller.resources.resources.memory
values, but this didn’t seem to have had any effect
d

David Espejo (he/him)

12/11/2023, 9:21 PM
j

Jake Dodd

12/12/2023, 10:36 AM
Hi David, I’ve tried increasing the default up to
10Gi
and it still gets
OOMKilled
at the same point every time, here is the full error log from the UI:
Copy code
[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[fc5d90298b68749979ad-n0-0] terminated with exit code (137). Reason [OOMKilled]. Message: 
1880 [================>.............] - ETA: 0s
17801216/26421880 [===================>..........] - ETA: 0s
20733952/26421880 [======================>.......] - ETA: 0s
22904832/26421880 [=========================>....] - ETA: 0s
25673728/26421880 [============================>.] - ETA: 0s
26421880/26421880 [==============================] - 1s 0us/step
Downloading data from <https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz>

5148/5148 [==============================] - 0s 0us/step
Downloading data from <https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz>

   8192/4422102 [..............................] - ETA: 0s
  49152/4422102 [..............................] - ETA: 7s
  81920/4422102 [..............................] - ETA: 9s
 327680/4422102 [=>............................] - ETA: 2s
 573440/4422102 [==>...........................] - ETA: 1s
 851968/4422102 [====>.........................] - ETA: 1s
1474560/4422102 [=========>....................] - ETA: 0s
2121728/4422102 [=============>................] - ETA: 0s
3375104/4422102 [=====================>........] - ETA: 0s
4422102/4422102 [==============================] - 1s 0us/step
.
d

David Espejo (he/him)

12/12/2023, 1:03 PM
how's the memory consumption in the node? I'm trying to reproduce the issue
j

Jake Dodd

12/12/2023, 1:34 PM
from running
kubectl top node
, currently there is a node consuming 24% memory
c

Chris Grass

12/12/2023, 4:03 PM
have you confirmed the task pod has 10gb of mem? i had a situation where i had config in two places and the mem of the pod was being overwritten unexpectedly
d

David Espejo (he/him)

12/12/2023, 4:44 PM
@Jake Dodd could you also describe the pod to confirm the actual resources being allocated to the Pod? like
kubectl get pods -n flytesnacks-development
then
kubectl get pod <execution-id-Pod> -n flytesnacks-development -o yaml
and find the
spec.resources
block
j

Jake Dodd

12/12/2023, 4:47 PM
this is the
spec.resources
block:
Copy code
resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
d

David Espejo (he/him)

12/12/2023, 8:51 PM
so, if you set your default memory request to 10Gi and no limit is set, Flyte makes request=limit which is more deterministic for the K8s scheduler. Now, if your actual Pod only requests 1Gi, somehow the default resources are not being applied. For now we can probably try overriding adding to the decorator:
Copy code
@task(requests=Resources(
        cpu="1",
        mem="4Gi")
j

Jake Dodd

12/13/2023, 9:32 AM
Hi David, I tried overriding at the task level and got the following error:
Copy code
RPC Failed, with Status: StatusCode.INVALID_ARGUMENT
        details: Requested MEMORY default [4Gi] is greater than current limit set in the platform configuration [1Gi]. Please contact Flyte Admins to change these limits or consult the configuration
I can share my
values.yaml
with you if that will help troubleshoot this?
2 Views