hey quick question when our flyte pods start up there is abo Flyte #flyte-support

hey quick question -- when our flyte pods start up...

hundreds-toothbrush-11241

02/28/2023, 9:03 PM

hey quick question -- when our flyte pods start up, there is about ~1 minute between it starting to unpack the workflow to our task starting:

Copy code

2023-02-28T15:52:39-05:00 tar: Removing leading `/' from member names
2023-02-28T15:53:09-05:00 
2023-02-28T15:53:31-05:00 2023-02-28 20:53:31.706 | INFO STARTED TASK

this is ubiquitous across all of our tasks and we're running a rather deep DAG locally, in serial (no --remote) it runs blazing fast (couple minutes) however, with the ~1min stall on each pod, it takes > 30minutes we'd rather not recombine tasks together if we don't need to so is there something we're missing here?

👍 1

tall-lock-23197

03/01/2023, 6:33 AM

Are you fast registering your workflow?

creamy-knife-86303

03/01/2023, 10:33 AM

I am also facing a similar situation

hundreds-toothbrush-11241

03/01/2023, 2:36 PM

am using

pyflyte run

so I believe so?

creamy-knife-86303

03/01/2023, 5:03 PM

@hundreds-toothbrush-11241 I am registering using pyflyte register and then executing from the console

hundreds-toothbrush-11241

03/01/2023, 6:25 PM

ya cli from our side to be clear same issue on both k3s linux and docker k8s mac so it doesn't seem as if it's dependent on the hardware either

hundreds-toothbrush-11241

03/01/2023, 7:26 PM

just to rule it out -- bumped the resources on a test task from 1500 mb mem to 5500 mb still a ~1 minute delay

tall-lock-23197

03/02/2023, 4:17 AM

Can you try registering your workflow instead of fast registering it?

pyflyte run

and

pyflyte register

resort to fast registration. https://docs.flyte.org/projects/cookbook/en/latest/getting_started/package_register.html#productionizing-your-workflows IMO, this should help reduce the time because the code will be present in the docker image and won't be pulled from s3.

thankful-minister-83577

03/02/2023, 9:45 PM

@hundreds-toothbrush-11241 would you be able to install a very verbose version of flytekit? like off a branch. really not sure what could be happening here

hundreds-toothbrush-11241

03/02/2023, 9:58 PM

because this is a local k8s dev env for the team, we want to avoid having to package, build new image, push, etc as to not slow iterations down, and rather focused on finding the root of the issue (pyflyte run is just super convenient) we managed to get specific jobs to break that 1min delay wall what we saw happening: • the helm chart has a default max limit on tasks mem of 1g • annotating the task with just the mem did not cause any errors, it just did not apply the larger amount (when i wrote i bumped to 5.5g and same, i investigated that further noticing the issue) -- pods started up with the helm limits applied • after addressing that by increasing the limit, our first task, which runs without parallelization, finished in ~12s (rather than 1-2min) • However, when running a lot of parallel tasks at the same time, the minute delay came in (we did bump the project resource max up to ensure that wouldn't bottleneck) at the moment it's still unsolved but I havent had too much time the last day or so to investigate further

thankful-minister-83577

03/02/2023, 10:01 PM

got it okay. yeah i get that i was regressing to adding print logic to try to narrow down where exactly the delay is coming from. but if it’s likely memory constrained that sounds good. we can look into those issues - you’re using the

flyte

helm chart?

thankful-minister-83577

03/02/2023, 10:02 PM

and you’re saying that if you do

@task(requests=Resources(mem="5.5G")

it does not get applied?

hundreds-toothbrush-11241

03/02/2023, 10:04 PM

one sec ill get ya the lines referenced

hundreds-toothbrush-11241

03/02/2023, 10:06 PM

when:

Copy code

task_resource_defaults:
        # -- Task default resources parameters
        task_resources:
          defaults:
            cpu: 100m
            memory: 200Mi
            storage: 5Mi
          limits:
            cpu: 1
            memory: 1Gi
            storage: 20Mi
            gpu: 1

was set as above:

mem

specified via

@task(requests=Resources(mem="5.5G")

was not applied to the pod the definition in the console showed the correct value but the pod itself had the helm limit that makes sense because those are task limits, but because there was no error stating we were going above the limit, it became a fools errand

hundreds-toothbrush-11241

03/02/2023, 10:19 PM

though resources seem to have made a difference when unlocked after the helm change for the first couple tasks, im still confused as the bottle neck doesnt seem to be the tasks execution the time to run two tasks in parallel is the same as many times more that in the pic attached, all of those tasks had the 1minute delay, but they also all completed reasonably after the actual task started if it were a resource constraint, i would expect only a few tasks to run at a time or the small set of truth swaps to run drastically faster than the training (ie not get get that delay, 4 of those tasks run in seconds locally) however, they were all effectively the same duration

158 Views

Open in Slack

Previous Next