I'm running into the same OOMKilled issue referenc...
# flyte-support
c
I'm running into the same OOMKilled issue referenced in this thread when attempting to run a "hello_world". Here is my the workflow script:
Copy code
from flytekit import task, workflow


@task
def say_hello(name: str) -> str:
    return f"Hello, {name}!"


@workflow
def hello_world_wf(name: str = 'world') -> str:
    res = say_hello(name=name)
    return res


if __name__ == "__main__":
    print(f"Running wf() {hello_world_wf(name='passengers')}")
Here is the command I'm running it with:
Copy code
pyflyte --config config.yaml run --remote hello_world.py hello_world_wf --name "test"
The say_hello task fails with: "tar: Removing leading `/' from member names". I tried updating the flyte-binary chart values as was mentioned in the references thread with:
Copy code
configuration:
  inline:
    task_resources:
      defaults:
        cpu: 1000m
        memory: 1000Mi
        storage: 1000Mi
      limits:
        memory: 2000Mi
But I still recieve the OOMKilled issue. Is that the correct place to configure the execution deployment?
p
"tar: Removing leading `/' from member names" is a common message and probably not something to be worried about. can you try updating your config.yaml file to match below
Copy code
domain: development
project: flyte
defaults:
  cpu: "1"
  memory: "1Gi"
limits:
  cpu: "1"
  memory: "1Gi"
then you can apply it rather than pass in the argument each time:
flytectl update task-resource-attribute --attrFile config.yaml
c
I get
Copy code
{
  "json": {},
  "level": "error",
  "msg": "error unmarshaling JSON: while decoding JSON: json: unknown field \"admin\"",
  "ts": "2024-02-07T15:12:01-07:00"
}
when I try to run that
but works when I explicitly pass the conifg
👍 1
so this is what the conifg.yaml looks like now:
Copy code
admin:
  # For GRPC endpoints you might want to use dns:///flyte.myexample.com
  endpoint: dns:///localhost:8089
  authType: Pkce
  insecure: true
logger:
  show-source: true
  level: 0
domain: development
project: flyte
defaults:
  cpu: "1"
  memory: "1Gi"
limits:
  cpu: "1"
  memory: "1Gi"
p
cool - not sure why you can't apply, but passing it along aint so bad : )
c
I tried that and I get the same OOMKilled error
I'm deploying to an EKS cluster btw, not sure if that's relevant. But it hasn't had issues with resource limitations like this, seems something isn't being configured properly
p
if you run
kubectl get pods {podname} -n {namespace} -o yaml
, what do you see in the
resources
section?
c
Copy code
f1356549b38a24b40a3c-n0-0   0/1     OOMKilled   0          6h41m
f17851e152bc5403b8b4-n0-0   0/1     OOMKilled   0          73m
f2707872a787246bc806-n0-0   0/1     OOMKilled   0          4h27m
f7ecf98631fcf4a60b88-n0-0   0/1     OOMKilled   0          68m
f964680826e7049dda77-n0-0   0/1     OOMKilled   0          3h10m
fa086beacc2d94fbfb83-n0-0   0/1     OOMKilled   0          107m
faae043eabec347428f4-n0-0   0/1     OOMKilled   0          3m20s
those are the executions
p
kubectl get pods f1356549b38a24b40a3c-n0-0 -o yaml
or maybe choose a pod that was recently killed
c
Copy code
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
    primary_container_name: f1356549b38a24b40a3c-n0-0
  creationTimestamp: "2024-02-07T15:32:34Z"
  labels:
    domain: development
    execution-id: f1356549b38a24b40a3c
    interruptible: "false"
    node-id: n0
    project: flytesnacks
    shard-key: "15"
    task-name: workflows-example-say-hello
    workflow-name: workflows-example-wf
  name: f1356549b38a24b40a3c-n0-0
  namespace: flytesnacks-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: f1356549b38a24b40a3c
    uid: ed8b03d3-3548-4904-88a7-ba53357d995b
  resourceVersion: "45428602"
  uid: a5d43e57-3a6f-437f-bed1-e3a6b3837dc5
spec:
  affinity: {}
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - <s3://titanflow-flyte-metadata/flytesnacks/development/2TYAHUQ3Q2NXEQIIS6IT2RLYS4======/script_mode.tar.gz>
    - --dest-dir
    - .
    - --
    - pyflyte-execute
    - --inputs
    - <s3://titanflow-flyte-metadata/metadata/propeller/flytesnacks-development-f1356549b38a24b40a3c/n0/data/inputs.pb>
    - --output-prefix
    - <s3://titanflow-flyte-metadata/metadata/propeller/flytesnacks-development-f1356549b38a24b40a3c/n0/data/0>
    - --raw-output-data-prefix
    - <s3://titanflow-flyte-userdata/data/zj/f1356549b38a24b40a3c-n0-0>
    - --checkpoint-path
    - <s3://titanflow-flyte-userdata/data/zj/f1356549b38a24b40a3c-n0-0/_flytecheckpoints>
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - workflows.example
    - task-name
    - say_hello
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: flytesnacks:development:<http://workflows.example.wf|workflows.example.wf>
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: f1356549b38a24b40a3c
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: development
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_TASK_NAME
      value: workflows.example.say_hello
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: Zi8CreB-Mt7L3ki48QSLjg
    - name: FLYTE_INTERNAL_PROJECT
      value: flytesnacks
    - name: FLYTE_INTERNAL_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_NAME
      value: workflows.example.say_hello
    - name: FLYTE_INTERNAL_VERSION
      value: Zi8CreB-Mt7L3ki48QSLjg
    - name: AWS_METADATA_SERVICE_TIMEOUT
      value: "5"
    - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
      value: "20"
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: us-east-1
    - name: AWS_REGION
      value: us-east-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::476134988768:role/CustomerManagedBasic-Iceberg-Writer
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.10-1.10.3|cr.flyte.org/flyteorg/flytekit:py3.10-1.10.3>
    imagePullPolicy: IfNotPresent
    name: f1356549b38a24b40a3c-n0-0
    resources:
      limits:
        cpu: "2"
        memory: 200Mi
      requests:
        cpu: "2"
        memory: 200Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-xxjf5
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-80-48-151.ec2.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: <http://sts.amazonaws.com|sts.amazonaws.com>
          expirationSeconds: 86400
          path: token
  - name: kube-api-access-xxjf5
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-02-07T15:32:34Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-02-07T15:32:40Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-02-07T15:32:40Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-02-07T15:32:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <containerd://1eecc22aaff2d2a6baeeee33454962f0f12ae8cf804367a7353b93b091da033>5
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.10-1.10.3|cr.flyte.org/flyteorg/flytekit:py3.10-1.10.3>
    imageID: <http://cr.flyte.org/flyteorg/flytekit@sha256:749fc0ab76071ab5f3d207e1e06071a170233a346cfc7b7fe5d4652655621a33|cr.flyte.org/flyteorg/flytekit@sha256:749fc0ab76071ab5f3d207e1e06071a170233a346cfc7b7fe5d4652655621a33>
    lastState: {}
    name: f1356549b38a24b40a3c-n0-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://1eecc22aaff2d2a6baeeee33454962f0f12ae8cf804367a7353b93b091da033>5
        exitCode: 137
        finishedAt: "2024-02-07T15:32:40Z"
        message: |
          tar: Removing leading `/' from member names
        reason: OOMKilled
        startedAt: "2024-02-07T15:32:37Z"
  hostIP: 2001:55a:2000:3601:7d80:89ad:6702:27eb
  phase: Failed
  podIP: 2001:55a:2000:3601:a9f5::c
  podIPs:
  - ip: 2001:55a:2000:3601:a9f5::c
  qosClass: Guaranteed
  startTime: "2024-02-07T15:32:34Z"
p
note the resources section doesn't have the config applies:
Copy code
resources:
      limits:
        cpu: "2"
        memory: 200Mi
      requests:
        cpu: "2"
        memory: 200Mi
c
I see I see
p
do you need
authType: Pkce
?
if you're not using auth at the moment, that can be removed
c
ok cool I removed it
Should that not be enough to run that basic say_hello python task?
p
i know i had problems with the default resource settings running a basic task (as have others)
you should also be able to cp the config to
~/.flyte/config.yaml
- that's the default location flytekit looks for the config. then you won't need to pass the location as an argument
c
ok I'll place it there
p
lmk if you get another OOM result. we can try patching the config explicitly using a secondary config
c
moved the config and submitted, still OOM
how the does the second config work?
p
create a
taskconfig.yaml
file in your task project dir
then run
flytectl update task-resource-attribute --attrFile taskconfig.yaml
the contents of file should look like:
Copy code
domain: development
project: flyte-az
defaults:
  cpu: "1"
  memory: "1Gi"
limits:
  cpu: "1"
  memory: "1Gi"
where your project and domain are updated to reflect your local env
c
when I run
flytectl update task-resource-attribute --attrFile taskconfig.yaml
I get
Copy code
Error: 
strict mode is on but received keys [map[defaults:{} limits:{}]] to decode with no config assigned to receive them: failed strict mode check
ERRO[0000] 
strict mode is on but received keys [map[defaults:{} limits:{}]] to decode with no config assigned to receive them: failed strict mode check  src="main.go:13"
Copy code
Updating the task resource attribute is only available from a generated file. See the get section for generating this file.
p
huh - which version of flytectl are you using?
another option is to update your flyte-binary helm chart. in values.yaml, find the
k8s
section. then add the following block:
Copy code
k8s:
    plugins:
      default-cpus: 500m
      default-memory: 512Mi
c
Copy code
brew info flytectl
==> flyteorg/tap/flytectl: stable 0.8.10
p
yup
c
I don't see a section for
k8s
p
ah, sry. you can place it under
configmap:
or, as David said in that post, you can add it to the inline section:
Copy code
configuration:
  inline:
    task_resources:
      defaults:
        cpu: 100m
        memory: 100Mi
        storage: 100Mi
      limits:
        memory: 2Gi
c
I'll give it another try and check the pod description
👍 1
I got a green!
awesome, thank you for the assist!
👍 1
p
sorry that took a few different angles to get straightened out
a
Thank you @proud-answer-87162 for going above and beyond! @creamy-energy-49433 let us know if any other issue arises. I feel a proper resource/doc/blog on profiling/resource planning and K8s limits/resources is needed
👍 2
🙌 1