hey quick question -- when our flyte pods start up...
# ask-the-community
hey quick question -- when our flyte pods start up, there is about ~1 minute between it starting to unpack the workflow to our task starting:
Copy code
2023-02-28T15:52:39-05:00 tar: Removing leading `/' from member names
2023-02-28T15:53:31-05:00 2023-02-28 20:53:31.706 | INFO STARTED TASK
this is ubiquitous across all of our tasks and we're running a rather deep DAG locally, in serial (no --remote) it runs blazing fast (couple minutes) however, with the ~1min stall on each pod, it takes > 30minutes we'd rather not recombine tasks together if we don't need to so is there something we're missing here?
Are you fast registering your workflow?
I am also facing a similar situation
am using
pyflyte run
so I believe so?
@justin hallquist I am registering using pyflyte register and then executing from the console
ya cli from our side to be clear same issue on both k3s linux and docker k8s mac so it doesn't seem as if it's dependent on the hardware either
just to rule it out -- bumped the resources on a test task from 1500 mb mem to 5500 mb still a ~1 minute delay
Can you try registering your workflow instead of fast registering it?
pyflyte run
pyflyte register
resort to fast registration. https://docs.flyte.org/projects/cookbook/en/latest/getting_started/package_register.html#productionizing-your-workflows IMO, this should help reduce the time because the code will be present in the docker image and won't be pulled from s3.
@justin hallquist would you be able to install a very verbose version of flytekit? like off a branch. really not sure what could be happening here
because this is a local k8s dev env for the team, we want to avoid having to package, build new image, push, etc as to not slow iterations down, and rather focused on finding the root of the issue (pyflyte run is just super convenient) we managed to get specific jobs to break that 1min delay wall what we saw happening: • the helm chart has a default max limit on tasks mem of 1g • annotating the task with just the mem did not cause any errors, it just did not apply the larger amount (when i wrote i bumped to 5.5g and same, i investigated that further noticing the issue) -- pods started up with the helm limits applied • after addressing that by increasing the limit, our first task, which runs without parallelization, finished in ~12s (rather than 1-2min) • However, when running a lot of parallel tasks at the same time, the minute delay came in (we did bump the project resource max up to ensure that wouldn't bottleneck) at the moment it's still unsolved but I havent had too much time the last day or so to investigate further
got it okay. yeah i get that i was regressing to adding print logic to try to narrow down where exactly the delay is coming from. but if it’s likely memory constrained that sounds good. we can look into those issues - you’re using the
helm chart?
and you’re saying that if you do
it does not get applied?
one sec ill get ya the lines referenced
Copy code
        # -- Task default resources parameters
            cpu: 100m
            memory: 200Mi
            storage: 5Mi
            cpu: 1
            memory: 1Gi
            storage: 20Mi
            gpu: 1
was set as above:
specified via
was not applied to the pod the definition in the console showed the correct value but the pod itself had the helm limit that makes sense because those are task limits, but because there was no error stating we were going above the limit, it became a fools errand
though resources seem to have made a difference when unlocked after the helm change for the first couple tasks, im still confused as the bottle neck doesnt seem to be the tasks execution the time to run two tasks in parallel is the same as many times more that in the pic attached, all of those tasks had the 1minute delay, but they also all completed reasonably after the actual task started if it were a resource constraint, i would expect only a few tasks to run at a time or the small set of truth swaps to run drastically faster than the training (ie not get get that delay, 4 of those tasks run in seconds locally) however, they were all effectively the same duration