https://flyte.org logo
#ask-the-community
Title
# ask-the-community
m

Mick Jermsurawong

06/27/2023, 6:32 PM
Are folks using Flyte to do hyperparameter optimization to the scale of 1.4k tasks in a single workflow? We are hitting error from K8S request entity too large, where Fyte propeller OSS did make a patch explicitly for this.
d

Dan Rammer (hamersaw)

06/27/2023, 6:58 PM
Hey @Mick Jermsurawong! Can you say a little more about the workflow structure? We introduced a flyteadmin flag to offload state portions of the workflow to the blobstore. Previously, these were stored in etcd as part of the FlyteWorkflow CR and could inflate the CR size. Unfortunately, this flag doesn't help much with dynamic or large fanout maptasks, but with only 1.4k it doesn't seem like a very large workflow respectively.
m

Mick Jermsurawong

06/27/2023, 8:42 PM
hi @Dan Rammer (hamersaw) yup it is a dynamic task spinning up large number of task
Copy code
@task(cache=True, cache_serialize=True, cache_version="v5")
def sql_from_file(...)
  ...

@dynamic
def evaluate(params1, params2, params3, params4) -> Dict:

    ....
    for p1 in params1:
        for p2 in params2:
            for p3 in params3:
                for p4 in params4:
                    res = sql_from_file(p1, p2, p3, p4)
just to confirm, the size of input should not affect size the state of CRD here right? meaning if my
paramsX
is a large list
we had problem hitting proto size limit before when we have that large input size (eg large dict/list), but i believe that's a different problem
(this is error we see and from log lines we see
etcdserver: request is too large
)
ah ok i didn't realize we can inspect underlying CRD. The CRD of our user is already gone, but i will check again to see how that looks like under the hood
Copy code
kubectl describe <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> <workflow name> -n <namespace>
1.4k it doesn't seem like a very large workflow
Thanks Dan for the check there! I did inspect the CRD, and the problem is that there are many failing tasks, and the CRD does record
Message:  Traceback
with large stacktrace.. (we delegated to jvm process run on Flyte pod) For a small example of 100 tasks, our CRD spans 14k lines
i think we have our resolution!
d

Dan Rammer (hamersaw)

06/28/2023, 4:08 PM
Yeah, so etcd has a limit on CR sizes - it's typically around 1.5MB but can change with the k8s deployment. In the CR you mentioned the
Message
field is very large - does this look like the correct field? There is a maxSize of 1024 that is being set on that, just want to make sure it's being correctly applied. Would it help if that maxSize was configurable? I suspect that dropping it to 256 would help a lot here?
m

Mick Jermsurawong

06/28/2023, 4:47 PM
ah it is this field message under
Error
Copy code
dn119:
              Task Node Status:
                P State:  ...
                Phase:    8
                Psv:      1
                Upd At:   2023-06-23T16:45:44.229049302Z
              Dynamic Node Status:
              Error:
                Code:     USER:Unknown
                Kind:     USER
                Message:  Traceback (most recent call last):

      File "/app/src/python/flyte/project_balance/balance_backtest/py_balance_backtest.binary.runfiles/third_party
...

    Failed to execute Spark job.
                Using JVM launcher from spark_runner.sh script...
i think capturing the whole error message makes sense to me.. We can definitely wrap the error we see and do our own processing/truncate instead of dropping at platform level
d

Dan Rammer (hamersaw)

06/28/2023, 5:44 PM
Ok, something to keep an eye on. If this is an issue that persists we'll need to truncate. I know it's something we do on the NodeStatus Message (as linked) and many other places - including reporting this
Error
field in the events (here). So the only way to currently get the full message is to view it in the CR, however I do believe the max value in the event is 100kb so it is still quite large.
2 Views