Are folks using Flyte to do hyperparameter optimization to t Flyte #flyte-support

Are folks using Flyte to do hyperparameter optimiz...

agreeable-flower-8989

06/27/2023, 6:32 PM

Are folks using Flyte to do hyperparameter optimization to the scale of 1.4k tasks in a single workflow? We are hitting error from K8S request entity too large, where Fyte propeller OSS did make a patch explicitly for this.

hallowed-mouse-14616

06/27/2023, 6:58 PM

Hey @agreeable-flower-8989! Can you say a little more about the workflow structure? We introduced a flyteadmin flag to offload state portions of the workflow to the blobstore. Previously, these were stored in etcd as part of the FlyteWorkflow CR and could inflate the CR size. Unfortunately, this flag doesn't help much with dynamic or large fanout maptasks, but with only 1.4k it doesn't seem like a very large workflow respectively.

agreeable-flower-8989

06/27/2023, 8:42 PM

hi @hallowed-mouse-14616 yup it is a dynamic task spinning up large number of task

Copy code

@task(cache=True, cache_serialize=True, cache_version="v5")
def sql_from_file(...)
  ...

@dynamic
def evaluate(params1, params2, params3, params4) -> Dict:

    ....
    for p1 in params1:
        for p2 in params2:
            for p3 in params3:
                for p4 in params4:
                    res = sql_from_file(p1, p2, p3, p4)

agreeable-flower-8989

06/27/2023, 8:45 PM

just to confirm, the size of input should not affect size the state of CRD here right? meaning if my

paramsX

is a large list

agreeable-flower-8989

06/27/2023, 8:45 PM

we had problem hitting proto size limit before when we have that large input size (eg large dict/list), but i believe that's a different problem

agreeable-flower-8989

06/27/2023, 8:47 PM

(this is error we see and from log lines we see

etcdserver: request is too large

)

agreeable-flower-8989

06/28/2023, 1:10 PM

ah ok i didn't realize we can inspect underlying CRD. The CRD of our user is already gone, but i will check again to see how that looks like under the hood

Copy code

kubectl describe <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> <workflow name> -n <namespace>

agreeable-flower-8989

06/28/2023, 1:39 PM

1.4k it doesn't seem like a very large workflow

Thanks Dan for the check there! I did inspect the CRD, and the problem is that there are many failing tasks, and the CRD does record

Message:  Traceback

with large stacktrace.. (we delegated to jvm process run on Flyte pod) For a small example of 100 tasks, our CRD spans 14k lines

agreeable-flower-8989

06/28/2023, 1:39 PM

i think we have our resolution!

hallowed-mouse-14616

06/28/2023, 4:08 PM

Yeah, so etcd has a limit on CR sizes - it's typically around 1.5MB but can change with the k8s deployment. In the CR you mentioned the

Message

field is very large - does this look like the correct field? There is a maxSize of 1024 that is being set on that, just want to make sure it's being correctly applied. Would it help if that maxSize was configurable? I suspect that dropping it to 256 would help a lot here?

agreeable-flower-8989

06/28/2023, 4:47 PM

ah it is this field message under

Error

Copy code

dn119:
              Task Node Status:
                P State:  ...
                Phase:    8
                Psv:      1
                Upd At:   2023-06-23T16:45:44.229049302Z
              Dynamic Node Status:
              Error:
                Code:     USER:Unknown
                Kind:     USER
                Message:  Traceback (most recent call last):

      File "/app/src/python/flyte/project_balance/balance_backtest/py_balance_backtest.binary.runfiles/third_party
...

    Failed to execute Spark job.
                Using JVM launcher from spark_runner.sh script...

agreeable-flower-8989

06/28/2023, 4:48 PM

i think capturing the whole error message makes sense to me.. We can definitely wrap the error we see and do our own processing/truncate instead of dropping at platform level

hallowed-mouse-14616

06/28/2023, 5:44 PM

Ok, something to keep an eye on. If this is an issue that persists we'll need to truncate. I know it's something we do on the NodeStatus Message (as linked) and many other places - including reporting this

Error

field in the events (here). So the only way to currently get the full message is to view it in the CR, however I do believe the max value in the event is 100kb so it is still quite large.

229 Views

Open in Slack

Previous Next