Hello! I ran into an issue where a task failed wit...
# ask-the-community
Hello! I ran into an issue where a task failed with a very large traceback, and the FlytePropeller is unable to update the workflow to Failed, so it is just stuck in Running. The log I'm seeing in FlytePropeller is:
Copy code
Failed to update workflow. Error [rpc error: code = ResourceExhausted desc = trying to send message larger than max (2159575 vs. 2097152)]"
Can this max value be adjusted in helm? I'm struggling to find exactly where to change it. I'm using FlytePropeller v1.1.47. Thanks!
yes it can be
i thought by default is truncates
also you can turn on spec offloading
@Ketan (kumare3) thanks! could you please point me to any documentation or more information on how to configure the maximum value? can you also provide some more information on spec offloading?
cc @Dan Rammer (hamersaw)
thank you! yes, I was able to find this from reading the documentation. I'm still stuck on figuring out how to just increase the max message size limit though. that's the most straightforward way of solving our immediate problem and I'd like to implement that before exploring other strategies. would someone be able to point me to where in the config that can be set?
@Anna Cunningham what Helm chart are you using? (
). I see this flag for the task executor and not sure if that's what you need:
maxLogMessageLength (int)
I'm concerned about the default value and while it shows "deprecated", I see it still in the code
helm chart. I found an option for configuring the max message size in FlyteAdmin but I can't find a corresponding option for FlytePropeller. I can try adjusting
in the meantime.
Digging into what @Ketan (kumare3) said more, I am wondering why the workflow update failed given that I also see the logic to truncate error messages. I am also wondering if spec offloading is actually relevant, since I thought that was about relieving pressure on etcd. So to confirm: • anytime we see a complaint about max message size, that's related to gRPC communications with flyteadmin, correct? • what does the request flow look like for propeller updating a workflow? Does it double write to etcd (assuming offloading is not turned on) and admin, or do all state updates flow through admin? ◦ what's the best way to debug this and understand what's driving the message size error? I am trying to get a better understanding of how this works by reading through the code, but I am missing some insights I think.
We've been able to reproduce a minimal example of this behavior using flyte milestone release 1.6.2 & flytekit 1.10.0:
Copy code
from flytekit import workflow, task

def raise_giant_error() -> None:
    error_line = "This is a really big error"
    n_lines = 100000
    raise ValueError("\n".join([error_line] * n_lines))

def giant_error_wf() -> None:
this results in the task being marked as FAILED showing the full giant error in the UI, but the workflow remains stuck in RUNNING and the FlytePropeller logs show
Failed to update workflow. Error [rpc error: code = ResourceExhausted desc = trying to send message larger than max (2805993 vs. 2097152)]
. This is not solved by adjusting the
. please let me know if there is more information I can provide!
@Matthew Corley - spec offloading will relieve pressure on etcD. as the spec will also not be stored in etcD, Its not a silver bullet - but will help. cc @Dan Rammer (hamersaw) FYI
@Anna Cunningham do you mind filing an issue? This is certainly a bug.
I will file an issue now. Thanks for looking into it! We did try the spec offloading but it didn't relieve the message size enough to resolve the error.