nice-market-38632
08/23/2024, 10:46 AM"error","msg":"Error when trying to reconcile workflow. Error [failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: output file @[<>/n0/data/0/outputs.pb] is too large [19664776] bytes, max allowed [10485760] bytes]. Error Type[*errors.NodeErrorWithCause]"
But the workflow is stuck on running on flyte console…..
I can see this issue to be similar but it doesn’t seem that this has been resolved.
https://github.com/flyteorg/flyte/issues/381high-accountant-32689
08/23/2024, 4:39 PMnice-market-38632
08/23/2024, 4:39 PMhigh-accountant-32689
08/23/2024, 4:41 PMnice-market-38632
08/23/2024, 5:23 PMnice-market-38632
08/23/2024, 5:25 PMhigh-accountant-32689
08/23/2024, 5:27 PMnice-market-38632
08/23/2024, 5:30 PMthankful-minister-83577
nice-market-38632
08/25/2024, 10:57 AMfrom flytekit import task, workflow, ImageSpec
from typing import List, Tuple, Union
normal_image = ImageSpec(
base_image="python:3.9-slim",
packages=["flytekit==1.10.3"],
registry="ttl.sh",
name="skdjbKBJ1341-normal",
source_root="..",
)
@task(container_image=normal_image)
def print_arrays(arr1: str) -> None:
print(f"Array 1: {arr1}")
@task(container_image=normal_image)
def increase_size_of_of_arrays(n: int) -> str:
arr1 = 'a' * n * 1024
return arr1
# Workflow: Orchestrate the tasks
@workflow
def simple_pipeline(n: int) -> int:
arr1 = increase_size_of_of_arrays(n=n)
print_arrays(arr1)
return 2
# Runs the pipeline locally
if __name__ == "__main__":
result = simple_pipeline(n=5)
I just verified it running in flyte sandbox also.
Please register the above file.
pyflyte --pkgs limit_eg package -f --source .
flytectl register files --project flytesnacks --domain development --archive flyte-package.tgz --version 1
Then run it from UI with n=11000 (i/o size will be some 11 MB).
This workflow is forever stuck in running now.
I can just see the propeller logs:
[failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: output file @[<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-at9h4tkhqx5fjbp6sbfm/n0/data/0/outputs.pb>] is too large [11264029] bytes, max allowed [10485760] bytes]. Error Type[*errors.NodeErrorWithCause]","ts":"2024-08-25T10:53:09Z"}
E0825 10:53:09.978923 1 workers.go:103] error syncing 'flytesnacks-development/at9h4tkhqx5fjbp6sbfm': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: output file @[<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-at9h4tkhqx5fjbp6sbfm/n0/data/0/outputs.pb>] is too large [11264029] bytes, max allowed [10485760] bytes
These are not propagated as an event to flyteadmin and hence its stuck forever without any error message on flyte consolenice-market-38632
08/25/2024, 11:49 AMnice-market-38632
08/26/2024, 8:25 AMthankful-minister-83577
flat-area-42876
08/27/2024, 12:47 AMk describe flyteworkflow -n <...> <exeuction_id>
thankful-minister-83577
nice-market-38632
08/27/2024, 6:26 AMnice-market-38632
08/27/2024, 6:30 AMapiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
kind: FlyteWorkflow
metadata:
creationTimestamp: '2024-08-27T06:24:56Z'
finalizers:
- flyte-finalizer
generation: 12
labels:
domain: development
execution-id: akmmktbbq7m987bc8f2n
project: flytesnacks
shard-key: '26'
workflow-name: limit-eg-test-simple-pipeline
name: akmmktbbq7m987bc8f2n
namespace: flytesnacks-development
resourceVersion: '1165624'
uid: 69194738-ea50-48a7-b6db-23e0ad69ace7
selfLink: >-
/apis/flyte.lyft.com/v1alpha1/namespaces/flytesnacks-development/flyteworkflows/akmmktbbq7m987bc8f2n
status:
dataDir: >-
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-akmmktbbq7m987bc8f2n>
defVersion: 1
failedAttempts: 4
lastUpdatedAt: '2024-08-27T06:25:01Z'
message: >-
failed at Node[n0]. RuntimeExecutionError: failed during plugin execution,
caused by: output file
@[<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-akmmktbbq7m987bc8f2n/n0/data/0/outputs.pb>]
is too large [11264029] bytes, max allowed [10485760] bytes
nodeStatus:
n0:
TaskNodeStatus:
pState: >-
S38DAQELUGx1Z2luU3RhdGUB/4AAAQMBBVBoYXNlAQYAAQ5LOHNQbHVnaW5TdGF0ZQH/ggABD0xhc3RFdmVudFVwZGF0ZQH/hAAAAD//gQMBAQtQbHVnaW5TdGF0ZQH/ggABAwEFUGhhc2UBBAABDFBoYXNlVmVyc2lvbgEGAAEGUmVhc29uAQwAAAAQ/4MFAQEEVGltZQH/hAAAAAv/gAECAQEKAQEAAA==
phase: 5
phaseVersion: 1
psv: 1
updAt: '2024-08-27T06:25:07.939094379Z'
dynamicNodeStatus: {}
laStartedAt: '2024-08-27T06:25:01Z'
lastUpdatedAt: '2024-08-27T06:25:01Z'
message: running
phase: 2
queuedAt: '2024-08-27T06:25:01Z'
startedAt: '2024-08-27T06:25:01Z'
start-node:
phase: 5
stoppedAt: '2024-08-27T06:25:01Z'
phase: 1
startedAt: '2024-08-27T06:25:01Z'
spec:
connections:
n0:
- n1
n1:
- end-node
start-node:
- n0
edges:
downstream:
n0:
- n1
n1:
- end-node
start-node:
- n0
upstream:
end-node:
- n1
n0:
- start-node
n1:
- n0
id: flytesnacks:development:limit_eg.test.simple_pipeline
nodes:
end-node:
id: end-node
inputBindings:
- binding:
scalar:
primitive:
integer: '2'
var: o0
kind: end
resources: {}
n0:
id: n0
inputBindings:
- binding:
promise:
nodeId: start-node
var: 'n'
var: 'n'
kind: task
name: increase_size_of_of_arrays
resources: {}
task: >-
resource_type:TASK project:"flytesnacks" domain:"development"
name:"limit_eg.test.increase_size_of_of_arrays" version:"3"
n1:
id: n1
inputBindings:
- binding:
promise:
nodeId: n0
var: o0
var: arr1
kind: task
name: print_arrays
resources: {}
task: >-
resource_type:TASK project:"flytesnacks" domain:"development"
name:"limit_eg.test.print_arrays" version:"3"
start-node:
id: start-node
kind: start
resources: {}
outputBindings:
- binding:
scalar:
primitive:
integer: '2'
var: o0
outputs:
variables:
o0:
type:
simple: INTEGER
acceptedAt: '2024-08-27T06:24:56Z'
executionConfig:
EnvironmentVariables: null
Interruptible: null
MaxParallelism: 25
OverwriteCache: false
RecoveryExecution: {}
TaskPluginImpls: {}
TaskResources:
Limits:
CPU: '2'
EphemeralStorage: '0'
GPU: '5'
Memory: 4Gi
Storage: '0'
Requests:
CPU: 500m
EphemeralStorage: '0'
GPU: '0'
Memory: 1Gi
Storage: '0'
executionId:
domain: development
name: akmmktbbq7m987bc8f2n
project: flytesnacks
inputs:
literals:
'n':
scalar:
primitive:
integer: '11000'
node-defaults: {}
rawOutputDataConfig: {}
securityContext:
run_as: {}
tasks:
resource_type:TASK project:"flytesnacks" domain:"development" name:"limit_eg.test.increase_size_of_of_arrays" version:"3":
container:
args:
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- limit_eg.test
- task-name
- increase_size_of_of_arrays
image: <http://ttl.sh/skdjbkbj1341-normal:ahPJ_Pe5dK7cEx5eLvclfA|ttl.sh/skdjbkbj1341-normal:ahPJ_Pe5dK7cEx5eLvclfA>
resources:
limits:
- name: CPU
value: 500m
- name: MEMORY
value: 1Gi
requests:
- name: CPU
value: 500m
- name: MEMORY
value: 1Gi
id:
domain: development
name: limit_eg.test.increase_size_of_of_arrays
project: flytesnacks
resourceType: TASK
version: '3'
interface:
inputs:
variables:
'n':
type:
simple: INTEGER
outputs:
variables:
o0:
type:
simple: STRING
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.4
type: python-task
resource_type:TASK project:"flytesnacks" domain:"development" name:"limit_eg.test.print_arrays" version:"3":
container:
args:
- pyflyte-execute
- '--inputs'
- '{{.input}}'
- '--output-prefix'
- '{{.outputPrefix}}'
- '--raw-output-data-prefix'
- '{{.rawOutputDataPrefix}}'
- '--checkpoint-path'
- '{{.checkpointOutputPrefix}}'
- '--prev-checkpoint'
- '{{.prevCheckpointPrefix}}'
- '--resolver'
- flytekit.core.python_auto_container.default_task_resolver
- '--'
- task-module
- limit_eg.test
- task-name
- print_arrays
image: <http://ttl.sh/skdjbkbj1341-normal:ahPJ_Pe5dK7cEx5eLvclfA|ttl.sh/skdjbkbj1341-normal:ahPJ_Pe5dK7cEx5eLvclfA>
resources:
limits:
- name: CPU
value: 500m
- name: MEMORY
value: 1Gi
requests:
- name: CPU
value: 500m
- name: MEMORY
value: 1Gi
id:
domain: development
name: limit_eg.test.print_arrays
project: flytesnacks
resourceType: TASK
version: '3'
interface:
inputs:
variables:
arr1:
type:
simple: STRING
outputs: {}
metadata:
retries: {}
runtime:
flavor: python
type: FLYTE_SDK
version: 1.13.4
type: python-task
workflowMeta:
eventVersion: 2
here is a CRD of the samenice-market-38632
08/27/2024, 6:30 AMflat-area-42876
08/27/2024, 6:35 AMflat-area-42876
08/27/2024, 6:36 AMnice-market-38632
08/27/2024, 6:41 AMnice-market-38632
08/27/2024, 6:46 AMpropeller:
downstream-eval-duration: 30s
enable-admin-launcher: true
leader-election:
enabled: true
lease-duration: 15s
lock-config-map:
name: propeller-leader
namespace: flyte
renew-deadline: 10s
retry-period: 2s
limit-namespace: all
max-workflow-retries: 30
metadata-prefix: metadata/propeller
metrics-prefix: flyte
prof-port: 10254
queue:
batch-size: -1
batching-interval: 2s
queue:
base-delay: 5s
capacity: 1000
max-delay: 120s
rate: 100
type: maxof
sub-queue:
capacity: 100
rate: 10
type: bucket
type: batch
rawoutput-prefix: <s3://my-s3-bucket/>
workers: 4
workflow-reeval-duration: 30s
this is what i found in flyte-propeller-config
``````thankful-minister-83577
failedAttempts: 4
field in it… do you have one that’s higher? like near the limit of 50? (just so we can look at the timestamps)thankful-minister-83577
flytectl demo
environment? i’ve been testing on our live backend, not in the sandbox environmentthankful-minister-83577
shard-key
in the crd? or does that just get added all the time?flat-area-42876
08/27/2024, 5:29 PMnice-market-38632
08/28/2024, 5:10 AMnice-market-38632
08/28/2024, 5:27 AMdo you have one that’s higher? like near the limit of 50?where to check this?
nice-market-38632
08/28/2024, 6:16 AMi’ve been testing on our live backendcan you share the config of your live backend please??
nice-market-38632
08/28/2024, 6:50 AMmax-workflow-retries: 30
to 5 reduces the time taken to mark failed, but it should be immediate.