Hi any tips when my task gets OOMKilled Im running with `fly Flyte #flyte-support

Hi, any tips when my task gets OOMKilled? Im runni...

shy-evening-51366

07/05/2023, 2:33 PM

Hi, any tips when my task gets OOMKilled? Im running with

flytectl demo start

. Ive already increased task resources to

mem=6Gi

(same code ran fine before with 2Gi so I wouldnt expect that to be the issue); Increase VM size (running Docker on macOS); Removing and re-initializing the demo environment. I didnt change much since I last ran it w/o issues, other than changing a workflow to now a dynamic workflow, and perhaps adding 1 additional task.

freezing-boots-56761

07/05/2023, 2:45 PM

can you check what the actual allocation of the task pod is?

fancy-plumber-70674

07/06/2023, 6:50 AM

Might need to check cluster-wide limits set on the flyte deployment. Flyte will accept task resource limits/requests over the cluster-wide max, but it won't throw an error message if 2gi is the limit and request is 6gi. It'll just stop provision at 2gi

fancy-plumber-70674

07/06/2023, 6:50 AM

@shy-evening-51366

shy-evening-51366

07/07/2023, 8:11 AM

Its a bit tricky to capture since the Pod dies almost instantly, so metrics server cant fetch it with

kubectl top pod -n tmp-development

for example. But node memory usage before is around 20% and I don’t see any sudden spike before the Task fails. @fancy-plumber-70674 I don’t see any LimitRange or ResourceQuota applied in the demo environment (only some general resource requests/limits around 100-200Mi but thats for the Flyte deployment itself, not the task Pods). Where could I find these cluster-wide limits?

shy-evening-51366

07/07/2023, 8:21 AM

Checked out the

FlyteWorkflow

CRD:

Copy code

Kind:  FlyteWorkflow
Execution Config:
  Environment Variables:  <nil>
  Interruptible:          <nil>
  Max Parallelism:        25
  Overwrite Cache:        false
  Recovery Execution:
  Task Plugin Impls:
  Task Resources:
    Limits:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                1
      Memory:             1Gi
      Storage:            0
    Requests:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                0
      Memory:             200Mi
      Storage:            0
Execution Id:
  Domain:   development
  Name:     fc4eefd4fe2614c0987f
  Project:  tmp

Im not sure where the

1Gi

here is set (I think from default here https://github.com/flyteorg/flyte/blob/1e3d515550cb338c2edb3919d79c6fa1f0da5a19/charts/flyte-core/values.yaml#L520C4-L531C15). Perhaps also Im misconfiguring the dynamic task’s resources? I have the resources set as follows. Should I configure the

@dynamic

workflow also to have 6Gi?

Copy code

@task(limits=Resources(mem="6Gi"))
def run_task()
   # do stuff

@dynamic(limits=Resources(mem="500Mi"))
def base_workflow(config: Config):
   for i in list:
      run_task()

@workflow
def wf(config: Config):
   base_workflow(config=config)

shy-evening-51366

07/07/2023, 8:43 AM

Tried this but no luck:

Copy code

flytectl update cluster-resource-attribute --attrFile cra.yaml

with

cra.yaml

Copy code

attributes:
    projectQuotaCpu: "1000"
    projectQuotaMemory: 8Gi
domain: development
project: tmp

shy-evening-51366

07/07/2023, 9:06 AM

Also cannot seem to use this field when running `flytectl demo start --config test-config.yaml`:

Copy code

task_resources:
  defaults:
    cpu: 100m
    memory: 200Mi
    storage: 100M
  limits:
    cpu: 500m
    gpu: 1
    memory: 8Gi
    storage: 10G

Gives:

Copy code

❯ flytectl demo start --config /Users/{user}/.flyte/config-sandbox.yaml
Error:
strict mode is on but received keys [map[task_resources:{}]] to decode with no config assigned to receive them: failed strict mode check
ERRO[0000]

shy-evening-51366

07/07/2023, 9:27 AM

Success! I followed the instructions here: https://github.com/flyteorg/flyte/pull/3061 Added the following to

flyte-sandbox-config

ConfigMap, and restarted the Flyte Pod:

Copy code

data:
  000-core.yaml: |
    ...
    task_resources:
      defaults:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2
        memory: 8Gi
        gpu: 5
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      - staging:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      - development:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      ...
    ...
    flyte:
      admin:
        disableClusterResourceManager: true
        ...

shy-evening-51366

07/07/2023, 9:28 AM

Perhaps these are also sane defaults for the standard sandbox (when running

flytectl demo start

fancy-plumber-70674

07/07/2023, 9:29 AM

I don't work for Flyte or Union.ai so might be best to ping someone else

freezing-boots-56761

07/07/2023, 12:33 PM

that task resources config block can go directly into ~/.flyte/sandbox/config.yaml and then run “flytectl demo reload”. and it will be used in subsequent invocations of sandboxes too. the config is passed through and merged with the sandbox’s base flyte config.

freezing-boots-56761

07/07/2023, 12:35 PM

maybe this is something that needs to be documented better. @average-finland-92144 cc

tall-lock-23197

07/07/2023, 1:47 PM

I encountered the same error as @shy-evening-51366 when attempting to add the task resources block to the existing config.yaml file.

Copy code

strict mode is on but received keys [map[task_resources:{}]] to decode with no config assigned to receive them: failed strict mode check
ERRO[0000]

shy-evening-51366

07/07/2023, 2:07 PM

Yeah @freezing-boots-56761 editing

~/.flyte/sandbox/config.yaml

doesnt seem to work. @tall-lock-23197 I used the following, maybe it helps you (it uses https://github.com/mikefarah/yq for yaml processing):

Copy code

# gets current configmap and store locally
kubectl get cm -n flyte flyte-sandbox-config -o=yaml > configmap-flyte-sandbox-config.yaml

# updates configmap with new values from local file 000-core.yaml
yq eval '.data."000-core.yaml" = "'"$(< ./flyte/000-core.yaml)"'"' configmap-flyte-sandbox-config.yaml > updated-configmap-flyte-sandbox-config.yaml 
kubectl -n flyte apply -f updated-configmap-flyte-sandbox-config.yaml

# restart flyte pods to use new values
kubectl delete pods -l <http://app.kubernetes.io/name=flyte-sandbox|app.kubernetes.io/name=flyte-sandbox> -n flyte

# cleanup
rm configmap-flyte-sandbox-config.yaml
rm updated-configmap-flyte-sandbox-config.yaml

shy-evening-51366

07/07/2023, 2:09 PM

000-core.yaml

looks like this (only thing I changed here is increase the resource (memory/cpu/etc.) limits:

Copy code

admin:
  endpoint: localhost:8089
  insecure: true
catalog-cache:
  endpoint: localhost:8081
  insecure: true
  type: datacatalog
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5
cluster_resources:
  customData:
  - production:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  - staging:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  - development:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  standaloneDeployment: false
  templatePath: /etc/flyte/cluster-resource-templates
logger:
  show-source: true
  level: 6
propeller:
  create-flyteworkflow-crd: true
webhook:
  certDir: /var/run/flyte/certs
  localCert: true
  secretName: flyte-sandbox-webhook-secret
  serviceName: flyte-sandbox-webhook
  servicePort: 443
flyte:
  admin:
    disableClusterResourceManager: true
    disableScheduler: false
    disabled: false
    seedProjects:
    - flytesnacks
  dataCatalog:
    disabled: false
  propeller:
    disableWebhook: false
    disabled: false

freezing-boots-56761

07/07/2023, 2:35 PM

it should work. trying to repro

freezing-boots-56761

07/07/2023, 2:36 PM

@shy-evening-51366 did you “flytectl demo reload” after?

freezing-boots-56761

07/07/2023, 2:37 PM

@tall-lock-23197 which file were you adding it to?

freezing-boots-56761

07/07/2023, 2:50 PM

Setting this config:

Copy code

> cat ~/.flyte/sandbox/config.yaml
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5

is passed through to the pod:

Copy code

> kubectl exec -it flyte-sandbox-79fc858b47-mj5w9 -- cat /etc/flyte/config.d/999-extra-config.yaml
Defaulted container "flyte" out of: flyte, flyteagent, wait-for-db (init)
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5

Seems to be working as intended.

freezing-boots-56761

07/07/2023, 2:51 PM

you shouldnt need to hack in configuration to the configmap.

shy-evening-51366

07/07/2023, 6:08 PM

@freezing-boots-56761 Yeah, the

flytectl demo reload

gives the

strict mode ...

message and doesn't apply. My config is in

~/.flyte/sandbox-config.yaml

(there is no config file in

~/.flyte/sandbox/

, only the

kubeconfig

shy-evening-51366

07/07/2023, 6:08 PM

Let me know if I can test anything else 👍 I'm running on macOS at the moment

freezing-boots-56761

07/07/2023, 6:08 PM

you need to create

~/.flyte/sandbox/config.yaml

freezing-boots-56761

07/07/2023, 6:09 PM

write it to that file. its not a

flytectl

config, but rather

flyte

config that's passed through to the pod.

shy-evening-51366

07/07/2023, 6:09 PM

Alright let me try just a sec

shy-evening-51366

07/07/2023, 6:21 PM

Hmm I started with the base settings again, and now I don’t encounter the OOMKilled issue 🥲

shy-evening-51366

07/07/2023, 6:21 PM

I will try a few times to repro, and do the reload

shy-evening-51366

07/07/2023, 6:23 PM

In general though it would be good perhaps to have a higher value for default resources for users of the demo environment (which I guess are mostly people trying Flyte out), they shouldnt have to deal with OOMKilled Pods etc. but rather on trying a few tasks and workflows.

shy-evening-51366

07/07/2023, 6:24 PM

Thanks for all the help again @freezing-boots-56761 you’re the best 👍

🙌🏽 1

freezing-boots-56761

07/07/2023, 6:46 PM

so the idea was to drop the defaults being injected by flyte if not specified

freezing-boots-56761

07/07/2023, 6:47 PM

for now, it probably makes sense to inject some more sane defaults though

115 Views

Open in Slack

Previous Next