Hi, any tips when my task gets OOMKilled? Im runni...
# ask-the-community
g
Hi, any tips when my task gets OOMKilled? Im running with
flytectl demo start
. Ive already increased task resources to
mem=6Gi
(same code ran fine before with 2Gi so I wouldnt expect that to be the issue); Increase VM size (running Docker on macOS); Removing and re-initializing the demo environment. I didnt change much since I last ran it w/o issues, other than changing a workflow to now a dynamic workflow, and perhaps adding 1 additional task.
j
can you check what the actual allocation of the task pod is?
t
Might need to check cluster-wide limits set on the flyte deployment. Flyte will accept task resource limits/requests over the cluster-wide max, but it won't throw an error message if 2gi is the limit and request is 6gi. It'll just stop provision at 2gi
@Geert
g
Its a bit tricky to capture since the Pod dies almost instantly, so metrics server cant fetch it with
kubectl top pod -n tmp-development
for example. But node memory usage before is around 20% and I don’t see any sudden spike before the Task fails. @Tommy Nam I don’t see any LimitRange or ResourceQuota applied in the demo environment (only some general resource requests/limits around 100-200Mi but thats for the Flyte deployment itself, not the task Pods). Where could I find these cluster-wide limits?
Checked out the
FlyteWorkflow
CRD:
Copy code
Kind:  FlyteWorkflow
Execution Config:
  Environment Variables:  <nil>
  Interruptible:          <nil>
  Max Parallelism:        25
  Overwrite Cache:        false
  Recovery Execution:
  Task Plugin Impls:
  Task Resources:
    Limits:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                1
      Memory:             1Gi
      Storage:            0
    Requests:
      CPU:                2
      Ephemeral Storage:  0
      GPU:                0
      Memory:             200Mi
      Storage:            0
Execution Id:
  Domain:   development
  Name:     fc4eefd4fe2614c0987f
  Project:  tmp
Im not sure where the
1Gi
here is set (I think from default here https://github.com/flyteorg/flyte/blob/1e3d515550cb338c2edb3919d79c6fa1f0da5a19/charts/flyte-core/values.yaml#L520C4-L531C15). Perhaps also Im misconfiguring the dynamic task’s resources? I have the resources set as follows. Should I configure the
@dynamic
workflow also to have 6Gi?
Copy code
@task(limits=Resources(mem="6Gi"))
def run_task()
   # do stuff

@dynamic(limits=Resources(mem="500Mi"))
def base_workflow(config: Config):
   for i in list:
      run_task()

@workflow
def wf(config: Config):
   base_workflow(config=config)
Tried this but no luck:
Copy code
flytectl update cluster-resource-attribute --attrFile cra.yaml
with
cra.yaml
Copy code
attributes:
    projectQuotaCpu: "1000"
    projectQuotaMemory: 8Gi
domain: development
project: tmp
Also cannot seem to use this field when running `flytectl demo start --config test-config.yaml`:
Copy code
task_resources:
  defaults:
    cpu: 100m
    memory: 200Mi
    storage: 100M
  limits:
    cpu: 500m
    gpu: 1
    memory: 8Gi
    storage: 10G
Gives:
Copy code
❯ flytectl demo start --config /Users/{user}/.flyte/config-sandbox.yaml
Error:
strict mode is on but received keys [map[task_resources:{}]] to decode with no config assigned to receive them: failed strict mode check
ERRO[0000]
Success! I followed the instructions here: https://github.com/flyteorg/flyte/pull/3061 Added the following to
flyte-sandbox-config
ConfigMap, and restarted the Flyte Pod:
Copy code
data:
  000-core.yaml: |
    ...
    task_resources:
      defaults:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 2
        memory: 8Gi
        gpu: 5
    cluster_resources:
      customData:
      - production:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      - staging:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      - development:
        - projectQuotaCpu:
            value: "8"
        - projectQuotaMemory:
            value: "16Gi"
      ...
    ...
    flyte:
      admin:
        disableClusterResourceManager: true
        ...
Perhaps these are also sane defaults for the standard sandbox (when running
flytectl demo start
?
t
I don't work for Flyte or Union.ai so might be best to ping someone else
j
that task resources config block can go directly into ~/.flyte/sandbox/config.yaml and then run “flytectl demo reload”. and it will be used in subsequent invocations of sandboxes too. the config is passed through and merged with the sandbox’s base flyte config.
maybe this is something that needs to be documented better. @David Espejo (he/him) cc
s
I encountered the same error as @Geert when attempting to add the task resources block to the existing config.yaml file.
Copy code
strict mode is on but received keys [map[task_resources:{}]] to decode with no config assigned to receive them: failed strict mode check
ERRO[0000]
g
Yeah @jeev editing
~/.flyte/sandbox/config.yaml
doesnt seem to work. @Samhita Alla I used the following, maybe it helps you (it uses https://github.com/mikefarah/yq for yaml processing):
Copy code
# gets current configmap and store locally
kubectl get cm -n flyte flyte-sandbox-config -o=yaml > configmap-flyte-sandbox-config.yaml

# updates configmap with new values from local file 000-core.yaml
yq eval '.data."000-core.yaml" = "'"$(< ./flyte/000-core.yaml)"'"' configmap-flyte-sandbox-config.yaml > updated-configmap-flyte-sandbox-config.yaml 
kubectl -n flyte apply -f updated-configmap-flyte-sandbox-config.yaml

# restart flyte pods to use new values
kubectl delete pods -l <http://app.kubernetes.io/name=flyte-sandbox|app.kubernetes.io/name=flyte-sandbox> -n flyte

# cleanup
rm configmap-flyte-sandbox-config.yaml
rm updated-configmap-flyte-sandbox-config.yaml
My
000-core.yaml
looks like this (only thing I changed here is increase the resource (memory/cpu/etc.) limits:
Copy code
admin:
  endpoint: localhost:8089
  insecure: true
catalog-cache:
  endpoint: localhost:8081
  insecure: true
  type: datacatalog
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5
cluster_resources:
  customData:
  - production:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  - staging:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  - development:
    - projectQuotaCpu:
        value: 8
    - projectQuotaMemory:
        value: 16Gi
  standaloneDeployment: false
  templatePath: /etc/flyte/cluster-resource-templates
logger:
  show-source: true
  level: 6
propeller:
  create-flyteworkflow-crd: true
webhook:
  certDir: /var/run/flyte/certs
  localCert: true
  secretName: flyte-sandbox-webhook-secret
  serviceName: flyte-sandbox-webhook
  servicePort: 443
flyte:
  admin:
    disableClusterResourceManager: true
    disableScheduler: false
    disabled: false
    seedProjects:
    - flytesnacks
  dataCatalog:
    disabled: false
  propeller:
    disableWebhook: false
    disabled: false
j
it should work. trying to repro
@Geert did you “flytectl demo reload” after?
@Samhita Alla which file were you adding it to?
Setting this config:
Copy code
> cat ~/.flyte/sandbox/config.yaml
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5
is passed through to the pod:
Copy code
> kubectl exec -it flyte-sandbox-79fc858b47-mj5w9 -- cat /etc/flyte/config.d/999-extra-config.yaml
Defaulted container "flyte" out of: flyte, flyteagent, wait-for-db (init)
task_resources:
  defaults:
    cpu: 500m
    memory: 1Gi
  limits:
    cpu: 2
    memory: 8Gi
    gpu: 5
Seems to be working as intended.
you shouldnt need to hack in configuration to the configmap.
g
@jeev Yeah, the
flytectl demo reload
gives the
strict mode ...
message and doesn't apply. My config is in
~/.flyte/sandbox-config.yaml
(there is no config file in
~/.flyte/sandbox/
, only the
kubeconfig
).
Let me know if I can test anything else 👍 I'm running on macOS at the moment
j
you need to create
~/.flyte/sandbox/config.yaml
write it to that file. its not a
flytectl
config, but rather
flyte
config that's passed through to the pod.
g
Alright let me try just a secc
Hmm I started with the base settings again, and now I don’t encounter the OOMKilled issue 🥲
I will try a few times to repro, and do the reload
In general though it would be good perhaps to have a higher value for default resources for users of the demo environment (which I guess are mostly people trying Flyte out), they shouldnt have to deal with OOMKilled Pods etc. but rather on trying a few tasks and workflows.
Thanks for all the help again @jeev you’re the best 👍
j
so the idea was to drop the defaults being injected by flyte if not specified
for now, it probably makes sense to inject some more sane defaults though