able-pizza-19502
09/28/2022, 8:31 AMWorkflow[flyte-anti-fraud-ml:development:app.workflow.main_flow] failed. RuntimeExecutionError: max number of system retry attempts [31/30] exhausted. Last known status message: failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes
I added max-output-size-bytes
params to the flyte-propeller-config
and wait to apply all changes before re-submitting a new task.
kubectl edit configmap -n flyte flyte-propeller-config
My propeller section of flyte-propeller-config
looks like:
core.yaml: |
manager:
pod-application: flytepropeller
pod-template-container-name: flytepropeller
pod-template-name: flytepropeller-template
propeller:
max-output-size-bytes: 52428800
downstream-eval-duration: 30s
enable-admin-launcher: true
leader-election:
enabled: true
lease-duration: 15s
lock-config-map:
name: propeller-leader
namespace: flyte
renew-deadline: 10s
retry-period: 2s
limit-namespace: all
max-workflow-retries: 3
metadata-prefix: metadata/propeller
metrics-prefix: flyte
prof-port: 10254
Task configuration has been set up via kubectl -n flyte edit cm flyte-admin-base-config
storage.yaml: |
storage:
type: minio
container: "my-s3-bucket"
stow:
kind: s3
config:
access_key_id: minio
auth_type: accesskey
secret_key: miniostorage
disable_ssl: true
endpoint: <http://minio.flyte.svc.cluster.local:9000>
region: us-east-1
signedUrl:
stowConfigOverride:
endpoint: <http://localhost:30084>
enable-multicontainer: false
limits:
maxDownloadMBs: 50
task_resource_defaults.yaml: |
task_resources:
defaults:
cpu: 1
memory: 3000Mi
storage: 100Mi
limits:
cpu: 4
gpu: 1
memory: 3Gi
storage: 500Mi
Also changing maxDownloadMBs
didn’t change the situation
Changing cache max_size_mbs
in flyte-propeller-config
from 0 to some custom value also not working:
cache.yaml: |
cache:
max_size_mbs: 100
target_gc_percent: 70
I ty to change different time with different params but the error was arising during each new executions.
I saw that none of max-output-size-bytes
or max-workflow-retries
(changed from 30 --> 3) are passed to the workflow execution:
RuntimeExecutionError: max number of system retry attempts [31/30] exhausted...
error file @[<s3://my-s3-bucket/metadata/propeller/flyte-anti-fraud-ml-development-f31c365f02c114639b00/n0/data/0/error.pb>] is too large [28775519] bytes, max allowed [10485760] bytes...
Hereafter are my cli steps to create a new execution:
- kubectl -n flyte edit cm flyte-admin-base-config
- kubectl edit configmap -n flyte flyte-propeller-config
- flytectl get task-resource-attribute -p flyteexamples -d development
- flytectl update project -p flyte-anti-fraud-ml -d development --storage.cache.max_size_mbs 100
- flytectl get launchplan --project flyte-anti-fraud-ml --domain development app.workflow.main_flow --latest --execFile exec_spec.yaml
- flytectl create execution --project flyte-anti-fraud-ml --domain development --execFile exec_spec.yaml
What additional steps I have to do to force flytectl to use my propeller changes and solve the problem of a max 10Mb size allowed for serialized uploads to flyte?tall-lock-23197
tall-lock-23197
able-pizza-19502
09/28/2022, 10:00 AMkubectl edit configmap -n flyte flyte-propeller-config
I have to wait 5 min for healthy upstream in flyte UI console (i though that it was restarting automaticaly after new changes in config are submitted).
How could I forcibly restart flyte-propeller
via kubectl cli
?able-pizza-19502
09/28/2022, 10:03 AMserialize
--> register
.PHONY: serialize
serialize:
echo ${CURDIR}
pyflyte -c flyte.config --pkgs app package \
--force \
--in-container-source-path /root \
--image ${FULL_IMAGE_NAME}:${VERSION}
.PHONY: register
# register: docker-push serialize
register: serialize
flytectl -c ${FLYTECTL_CONFIG} \
register files \
--project ${PROJECT} \
--domain development \
--archive flyte-package.tgz \
--force \
--version ${VERSION}
Also according to docs i saw that some flags are deprecated in favour of using congif file via `flyte admin`:
--sourceUploadPath string Deprecated: Update flyte admin to avoid having to configure storage access from flytectl.
My problem that I have one task that generates a dataclasses instances on the exit and another task should takes these classes as input params:
@workflow
def main_flow() -> Forecast:
"""
Main Flyte WorkFlow consisting of three tasks:
- @preproc_and_split
- @train_xgboost_clf
- @get_predictions
"""
<http://logger.info|logger.info>(log="#START -- START Raw Preprocessing and Splitting", timestamp=None)
train_cls, target_cls = preproc_and_split()
<http://logger.info|logger.info>(log="#START -- START Initialize Boosting Params", timestamp=None)
saved_mpath = train_xgboost_clf(
feat_cls=train_cls,
target_cls=target_cls,
xgb_params=xgb_params,
cust_metric=BoostingCustMetric
)
Where
def preproc_and_split() -> Tuple[Fraud_Raw_PostProc_Data_Class, Fraud_Raw_Target_Data_Class]:
These dataclasses are produced only after task execution so i cannot register them as some proto-files coz they are Promising return args for my function.
10mb error arises during sharing these dataclasses between tasks.
And i’m wondering of how to apply changes in flyte-propeller-config
to pass these serialisation limitationstall-lock-23197
kubectl -n flyte rollout restart deploy flytepropeller
is the restart command.able-pizza-19502
09/28/2022, 10:06 AMable-pizza-19502
09/28/2022, 11:05 AMkubectl -n flyte rollout restart deploy flytepropeller
able-pizza-19502
09/28/2022, 11:06 AMable-pizza-19502
09/28/2022, 11:06 AM