Hi, In a model pipeline, the tasks are taking lon...
# ask-the-community
Hi, In a model pipeline, the tasks are taking longer time to run than usual. A func. which would take ~2 mins to run is taking 35-40 min when using flyte. When I checked the cloudwatch logs, I see that there is a dormant period in every task which is taking up >98% of run time. In the image attached below, after 18 min (@ 17:27), the actual process is running and it takes seconds to complete. In between the initialization and the actual code run, when I checked the logs, these are available:
Are these contributing to the increased runtime ?
@KS Tarun this is very unusual. Flyte does have some unavoidable overhead to ensure reproducability, scalability, etc but these are typically minimal and account for a relatively small portion of the overall runtime. Do you have access to the k8s Pod information that Flyte uses to execute the task? I'm interested in the Pods start / end times and then the container start / end times as well.
Also, how big is the data sent as inputs to the task and what is the type?
Does the task produce a dataframe? if so, it might be the flyte deck issue.
@Dan Rammer (hamersaw) The shape of the dataframe is (25000,12000). I don't have the exact timestamps of the pods. But as soon as the task starts running, I checked using the
kubectl get pods
and it shows the respective pod is running and on describing the pod
kubectl describe pod
it is showing this:
@Kevin Su Yes,The output of a task is returned as dataframe & is given as input to the next task. Can you elaborate on the flyte deck issue ?
The problem is that flytekit will create a very big HTML file that renders a dataframe table. you can disable it by
, or you can render top 10 rows only
Copy code
def t1() -> typing.Annotated[pd.DataFrame, TopFrameRenderer(10)]:
we have disabled flyte deck by default since flytekit v1.2.3.
@Kevin Su / @Yee / @Eduardo Apolinario (eapolinario) we have been diving into this a bit and just had a few flytekit questions. Basic context: There is a task which has about 1G - 2G data on the low-end for input / output values. Flytepropeller is executing the task and the container is immediately created to start running user code. This runs for 10+ minutes (which is longer than expected - locally around 2-3min). Once the container completes the cloudwatch logs, which are output from the task, are all dumped with the same timestamp. A few questions here: (1) are the cloudwatch logs captured in real-time or once the task completes they are all gathered and written? (2) how can we better profile the execution duration from a flytekit-side? users tried to manually write the dataframe inputs / outputs to S3 and then pass file paths between tasks with some performance improvements, but still not as fast as hoped.
1. so cloudwatch logs are just a capture of the output from the container. there’s some agent on the nodes that will take container logs and push them up to cloudwatch. these agents work in real time afaik, with some delay. so it should be as the logging happens. everything with the same time stamp is a bit weird. can you try again @KS Tarun with kubectl logs? if k logs is showing incremental text, but cloudwatch doesn’t then it’s something with the agent. if it’s also nothing until done then this is a deeper issue that we should look at. how many log statements are there? maybe flushing of the buffer needs to happen more frequently.
2. i think we should talk about this @Dan Rammer (hamersaw) - maybe there’s something we can tie into the stuff you’re doing on the propeller side. in general i just increase the verbosity of the logging by setting
and looking at logging levels. but if you’re saying they’re all the same timestamp then that’s not really helpful. be nice to have a timeline in flyte deck though right?
also, re 1. how are the logs being generated? flytekit's loggers are not buffered, but I've seen
statements get buffered.
@Yee The flushing of the buffer is currently setup to happen once in 5 minutes. @Eduardo Apolinario (eapolinario) Yes, print statements are being used in this case. And they are reflected in the cloudwatch log of the respective pod with the same timestamp.
what version of flytekit is this?
and is it every task in the workflow?
or only some of them?