Hi all 👋
I have a problem of sharing data between tasks.
I found a similar issue here in discussions (
link)
Workflow[flyte-anti-fraud-ml:development:app.workflow.main_flow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes
I added
max-output-size-bytes
params to the
flyte-propeller-config
and wait to apply all changes before re-submitting a new task.
kubectl edit configmap -n flyte flyte-propeller-config
My propeller section of
flyte-propeller-config
looks like:
core.yaml: |
manager:
pod-application: flytepropeller
pod-template-container-name: flytepropeller
pod-template-name: flytepropeller-template
propeller:
max-output-size-bytes: 52428800
downstream-eval-duration: 30s
enable-admin-launcher: true
leader-election:
enabled: true
lease-duration: 15s
lock-config-map:
name: propeller-leader
namespace: flyte
renew-deadline: 10s
retry-period: 2s
limit-namespace: all
max-workflow-retries: 3
metadata-prefix: metadata/propeller
metrics-prefix: flyte
prof-port: 10254
Task configuration has been set up via
kubectl -n flyte edit cm flyte-admin-base-config
storage.yaml: |
storage:
type: minio
container: "my-s3-bucket"
stow:
kind: s3
config:
access_key_id: minio
auth_type: accesskey
secret_key: miniostorage
disable_ssl: true
endpoint: <http://minio.flyte.svc.cluster.local:9000>
region: us-east-1
signedUrl:
stowConfigOverride:
endpoint: <http://localhost:30084>
enable-multicontainer: false
limits:
maxDownloadMBs: 50
task_resource_defaults.yaml: |
task_resources:
defaults:
cpu: 1
memory: 3000Mi
storage: 100Mi
limits:
cpu: 4
gpu: 1
memory: 3Gi
storage: 500Mi
Also changing
maxDownloadMBs
didn’t change the situation
Changing cache
max_size_mbs
in
flyte-propeller-config
from 0 to some custom value also not working:
cache.yaml: |
cache:
max_size_mbs: 100
target_gc_percent: 70
I ty to change different time with different params but the error was arising during each new executions.
I saw that none of
max-output-size-bytes
or
max-workflow-retries
(changed from 30 --> 3) are passed to the workflow execution:
RuntimeExecutionError: max number of system retry attempts [31/30] exhausted...
error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes...
Hereafter are my cli steps to create a new execution:
- kubectl -n flyte edit cm flyte-admin-base-config
- kubectl edit configmap -n flyte flyte-propeller-config
- flytectl get task-resource-attribute -p flyteexamples -d development
- flytectl update project -p flyte-anti-fraud-ml -d development --storage.cache.max_size_mbs 100
- flytectl get launchplan --project flyte-anti-fraud-ml --domain development app.workflow.main_flow --latest --execFile exec_spec.yaml
- flytectl create execution --project flyte-anti-fraud-ml --domain development --execFile exec_spec.yaml
What additional steps I have to do to force flytectl to use my propeller changes and solve the problem of a max 10Mb size allowed for serialized uploads to flyte?