Hello I ran into an issue where a task failed with a very la Flyte #flyte-support

Hello! I ran into an issue where a task failed wit...

abundant-laptop-47033

10/31/2023, 6:45 PM

Hello! I ran into an issue where a task failed with a very large traceback, and the FlytePropeller is unable to update the workflow to Failed, so it is just stuck in Running. The log I'm seeing in FlytePropeller is:

Copy code

Failed to update workflow. Error [rpc error: code = ResourceExhausted desc = trying to send message larger than max (2159575 vs. 2097152)]"

Can this max value be adjusted in helm? I'm struggling to find exactly where to change it. I'm using FlytePropeller v1.1.47. Thanks!

freezing-airport-6809

10/31/2023, 8:43 PM

yes it can be

freezing-airport-6809

10/31/2023, 8:43 PM

i thought by default is truncates

freezing-airport-6809

10/31/2023, 8:44 PM

also you can turn on spec offloading

abundant-laptop-47033

10/31/2023, 9:00 PM

@freezing-airport-6809 thanks! could you please point me to any documentation or more information on how to configure the maximum value? can you also provide some more information on spec offloading?

tall-lock-23197

11/01/2023, 5:25 AM

cc @hallowed-mouse-14616

thankful-minister-83577

11/01/2023, 10:03 AM

spec offloading option can be set under applicationconfiguration: https://github.com/flyteorg/flyte/blob/1b92105d0750da88414962237e530a16e573f81c/flyteadmin/pkg/runtime/interfaces/application_configuration.go#L96

gratitude thank you 1

abundant-laptop-47033

11/01/2023, 3:38 PM

thank you! yes, I was able to find this from reading the documentation. I'm still stuck on figuring out how to just increase the max message size limit though. that's the most straightforward way of solving our immediate problem and I'd like to implement that before exploring other strategies. would someone be able to point me to where in the config that can be set?

average-finland-92144

11/01/2023, 4:31 PM

@abundant-laptop-47033 what Helm chart are you using? (

flyte-binary

flyte-core

). I see this flag for the task executor and not sure if that's what you need:

maxLogMessageLength (int)

I'm concerned about the default value and while it shows "deprecated", I see it still in the code

abundant-laptop-47033

11/01/2023, 4:35 PM

using

flyte-core

helm chart. I found an option for configuring the max message size in FlyteAdmin but I can't find a corresponding option for FlytePropeller. I can try adjusting

maxLogMessageLength

in the meantime.

full-ram-17934

11/01/2023, 9:16 PM

Digging into what @freezing-airport-6809 said more, I am wondering why the workflow update failed given that I also see the logic to truncate error messages. I am also wondering if spec offloading is actually relevant, since I thought that was about relieving pressure on etcd. So to confirm: • anytime we see a complaint about max message size, that's related to gRPC communications with flyteadmin, correct? • what does the request flow look like for propeller updating a workflow? Does it double write to etcd (assuming offloading is not turned on) and admin, or do all state updates flow through admin? ◦ what's the best way to debug this and understand what's driving the message size error? I am trying to get a better understanding of how this works by reading through the code, but I am missing some insights I think.

abundant-laptop-47033

11/02/2023, 8:50 PM

We've been able to reproduce a minimal example of this behavior using flyte milestone release 1.6.2 & flytekit 1.10.0:

Copy code

from flytekit import workflow, task


@task
def raise_giant_error() -> None:
    error_line = "This is a really big error"
    n_lines = 100000
    raise ValueError("\n".join([error_line] * n_lines))


@workflow
def giant_error_wf() -> None:
    raise_giant_error()

this results in the task being marked as FAILED showing the full giant error in the UI, but the workflow remains stuck in RUNNING and the FlytePropeller logs show

Failed to update workflow. Error [rpc error: code = ResourceExhausted desc = trying to send message larger than max (2805993 vs. 2097152)]

. This is not solved by adjusting the

maxMessageSizeBytes

UseOffloadedWorkflowClosure

. please let me know if there is more information I can provide!

freezing-airport-6809

11/06/2023, 5:30 AM

@full-ram-17934 - spec offloading will relieve pressure on etcD. as the spec will also not be stored in etcD, Its not a silver bullet - but will help. cc @hallowed-mouse-14616 FYI

hallowed-mouse-14616

11/06/2023, 3:05 PM

@abundant-laptop-47033 do you mind filing an issue? This is certainly a bug.

abundant-laptop-47033

11/06/2023, 4:34 PM

I will file an issue now. Thanks for looking into it! We did try the spec offloading but it didn't relieve the message size enough to resolve the error.

abundant-laptop-47033

11/06/2023, 4:42 PM

https://github.com/flyteorg/flyte/issues/4371

🙌🏽 1

🙌 1

7 Views

Open in Slack

Previous Next