Hey there We are noting some relatively high latency for sim Flyte #flyte-support

Hey there! We are noting some relatively high late...

swift-coat-31700

02/16/2023, 3:06 PM

Hey there! We are noting some relatively high latency for simple task execution. We're using the flyte-binary deployment in GKE. Tasks start immediately and the pods show complete but take a couple minutes before formally completing in flyte (these are simple hello world tasks). Are there any knobs to turn to lower this execution latency. Some sort of heartbeat interval we can tighten? We've tried lowering both workflow-reeval-duration and downstream-eval-duration, but regardless a task, even running on its own, takes ~2 minutes. @hallowed-mouse-14616

hallowed-mouse-14616

02/16/2023, 3:14 PM

Hey @swift-coat-31700 - thanks for the tag! So first-off perf is something we're spending considerable time on ironing out here. I wrote an RFC on improving observability and made considerable progress here. I don't think it will be ready for 1.4 (end of February), but will certainly be in 1.5.

hallowed-mouse-14616

02/16/2023, 3:15 PM

Currently, our docs have a brief optimizing performance section. That may offer some solution.

hallowed-mouse-14616

02/16/2023, 3:16 PM

Unfortunately, perf tuning requires some level of esoteric knowledge on how propeller works - this will all be updated with the performance observability stuff, with better information about how to identify and mitigate overhead.

hallowed-mouse-14616

02/16/2023, 3:17 PM

Lets dive into this though - can you provide some timestamps? Ex - pod start / end times, container start / end times? flyte task start / end times?

hallowed-mouse-14616

02/16/2023, 3:18 PM

We expect some level of overhead for task executions, but this should be measureable in seconds - if it's much larger there can be issues with image pulls, k8s informers, or flyte queueing / reevaluations as you mentioned.

swift-coat-31700

02/16/2023, 3:38 PM

Hi Dan. I'm doing some more thorough digging into individual task executions. Appears the latency is coming from the Pod itself. We're using the ghcr.io/flyteorg/flytekit:py3.10-1.3.1 image and it is present on the machine. Event logs in GKE show the container pulled, scheduled, started in about 4 seconds. Start time for container was 103051 AM and I added a stdout print test msg for the task starting execution which comes through at 103215 AM. Total runtime for the task was 1m 41s. So appears most of the time is tied to some sort of "cold start" of the container image itself. There's no logs coming from the container that may clue me in to what it is doing.

hallowed-mouse-14616

02/16/2023, 4:44 PM

So it sounds like the Flyte orchestration overhead is relatively small, ~17s which is still larger than we would expect but not the largest issue here. Is tested with fast registration? @thankful-minister-83577 @high-accountant-32689 any ideas what would take 1m30s for flytekit to start running the task?

swift-coat-31700

02/16/2023, 4:49 PM

I believe i'm using "fast registration". Just calling pyflyte run --remote

hallowed-mouse-14616

02/16/2023, 7:22 PM

Yeah, then you're using the fast registration process. To try and be brief here, this is designed to quickly deploy and iterate over workflow executions, however there is additional overhead. When you call

pyflyte run --remote

Flyte packages the python code and writes it to the blobstore, then when a task is executed it needs to download the code from the blobstore, decompress it, and then it can execute it. The alternative approach is to build a specific image that already contains the code. It is explained in more depth here. This way, executing a task just starts a container and runs the python fucntion (without needing to download the code from blobstore first). This is the preferred approach for productionized workflows because of the performance. I don't think it will have significant impact on the task execution time, but it may be worth trying. If the blobstore access if very slow fast registration will result in more overhead.

thankful-minister-83577

02/16/2023, 7:27 PM

could you run with more logging?

FLYTE_SDK_LOGGING_LEVEL=10

set that as part of the environment variable in the @task decorator itself.

thankful-minister-83577

02/16/2023, 7:28 PM

i think it’s

@task(environment={"FLYTE_SDK_LOGGING_LEVEL": "10"})

thankful-minister-83577

02/16/2023, 7:28 PM

will make it a lot more verbose, should clue us in as to where the time is being spent.

swift-coat-31700

02/16/2023, 7:55 PM

Yes, now getting much more error logs. Appears to be many passes at "Setting protocol to file". Many many registering events. Here's a sampling:

Automatically registering file:// as file with <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x7f6a5a8593f0>

Registered <flytekit.types.structured.basic_dfs.PandasToParquetEncodingHandler object at 0x7f6a5a8593f0> as handler for <class 'pandas.core.frame.DataFrame'>, protocol gs, fmt parquet

Automatically registering file:// as file with <flytekit.types.structured.basic_dfs.ParquetToPandasDecodingHandler object at 0x7f6a5a7eb670>

thankful-minister-83577

02/16/2023, 7:57 PM

are there time stamps

thankful-minister-83577

02/16/2023, 7:57 PM

associated with those log lines?

thankful-minister-83577

02/16/2023, 7:57 PM

the default logger should print things in json

swift-coat-31700

02/16/2023, 8:00 PM

yes. They start at 2023-02-16 194700 and last register event log entry is at 2023-02-16 194752 (52 seconds of streaming logs like the ones above). Then "Welcome to Flyte!" at 144814. Then task completed within 3 seconds.

swift-coat-31700

02/16/2023, 8:01 PM

a Snippit

thankful-minister-83577

02/16/2023, 9:26 PM

thanks for the screenshot @swift-coat-31700.

thankful-minister-83577

02/16/2023, 9:27 PM

i haven’t looked into the cloudwatch log uploader in a while - qq for you - how accurate do you think those timestamps are?

thankful-minister-83577

02/16/2023, 9:27 PM

the concerning bit is that it seems to have taken 30s from when python was loaded to when the download even started.

thankful-minister-83577

02/16/2023, 9:30 PM

also, backing up a bit. you say this is on GKE? so you’ve deployed minio? (the paths are all s3:// paths, not gs:// paths)

swift-coat-31700

02/17/2023, 2:48 PM

Hi @thankful-minister-83577. The timestamps should be accurate. They align with the timestamps provided by the Json payload:

swift-coat-31700

02/17/2023, 2:48 PM

Yes, this is on GKE. We have Minio deployed in that environment and it is configured with the minio endpoint.

swift-coat-31700

02/17/2023, 4:37 PM

@thankful-minister-83577 @hallowed-mouse-14616 I figured this out. It was due to low CPU limits set on the tasks. When configuring our cluster, we set a low default request of 0.1 cpu allocation for tasks. We tend to set requests and not limits on CPU usage so that our services can utilize the hardware in an eager manner. Appears Flyte defaults all k8s resource "limits" for both CPU and Memory to be based on "requested" resources. I was able to set the resource allocation explicitly via the task decorator.... requests=Resources(cpu="0.1"), limits=Resources(cpu="2")

🙌 1

swift-coat-31700

02/17/2023, 4:38 PM

It was just being throttled. Thanks for both of your times

❤️ 1

209 Views

Open in Slack

Previous Next