< tall lock 23197> What is the expected minimum run time for Flyte #flyte-support

<@U01J90KBSU9> What is the expected minimum run_ti...

clever-exabyte-82294

05/25/2024, 11:17 AM

@tall-lock-23197 What is the expected minimum run_time for a flyte pod. My task : • print the current epoch time and sleep for 25 seconds. • Time taken on local sandbox = 40 seconds • Time taken on remote cluster = 1 min. The pod is taking too much extra for no apparent reason. I am using the default Docker Image: cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7

freezing-airport-6809

05/25/2024, 4:35 PM

This depends on a lot of parameters

freezing-airport-6809

05/25/2024, 4:36 PM

If you see 1 min is the runtime of the python code

freezing-airport-6809

05/25/2024, 4:36 PM

This could be because of data - python env etc

freezing-airport-6809

05/25/2024, 4:36 PM

This is where we are working on rust runtime to make things go faster

freezing-airport-6809

05/25/2024, 4:36 PM

Also data movement from s3 etc can be slow

freezing-airport-6809

05/25/2024, 4:37 PM

You can enable deck and timeline deck to see where it is spending time

freezing-airport-6809

05/25/2024, 4:37 PM

If you want fast execution in Union we have a way to make workflows and task to in milliseconds

freezing-airport-6809

05/25/2024, 4:37 PM

Happy to demo

glamorous-carpet-83516

05/26/2024, 7:14 AM

could you share the code snippet to repro

clever-exabyte-82294

05/26/2024, 8:26 AM

The Code is just print and sleep statement.

Copy code

import time
from datetime import datetime
import pytz
from flytekit import Resources, task, workflow


def dt():
    utc_now = datetime.now(pytz.utc)
    print(str(utc_now))
    return time.time()

@task(interruptible=True, retries=2)
def timed_task(a: int) -> dict:
    a1 = dt()
    print("Hello, I should  take time ", a, a1)
    time.sleep(a)
    a2 = dt()
    print("I am done", a2)
    return {"a":a, "diff": a2-a1}

@workflow
def time_taken_wf() -> None:
    timed_task(a=20) >> timed_task(a=20) >> timed_task(a=20)
    timed_task(a=25)
    timed_task(a=40)
    timed_task(a=100)

I am using default docker image.

clever-exabyte-82294

05/26/2024, 8:28 AM

yaml file for the pod.

Copy code

apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
    primary_container_name: fc200e7ff601b48c2840-n1-0
  creationTimestamp: "2024-05-26T08:22:34Z"
  labels:
    domain: development
    execution-id: fc200e7ff601b48c2840
    interruptible: "true"
    my-label: my-project
    node-id: n1
    project: my-project
    shard-key: "16"
    task-name: app-timed-task
    workflow-name: app-time-taken-wf
  name: fc200e7ff601b48c2840-n1-0
  namespace: my-project-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: fc200e7ff601b48c2840
    uid: 16a7a111-fbbd-4f52-afb5-d3878733616e
  resourceVersion: "12158"
  uid: 567d4ee4-7dba-4140-82b1-7ec4684d1207
spec:
  affinity: {}
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - <s3://my-s3-bucket/my-project/development/EBR637LASIENUKDJPWFUXA47K4======/script_mode.tar.gz>
    - --dest-dir
    - .
    - --
    - pyflyte-execute
    - --inputs
    - <s3://my-s3-bucket/metadata/propeller/my-project-development-fc200e7ff601b48c2840/n1/data/inputs.pb>
    - --output-prefix
    - <s3://my-s3-bucket/metadata/propeller/my-project-development-fc200e7ff601b48c2840/n1/data/0>
    - --raw-output-data-prefix
    - <s3://my-s3-bucket/data/sd/fc200e7ff601b48c2840-n1-0>
    - --checkpoint-path
    - <s3://my-s3-bucket/data/sd/fc200e7ff601b48c2840-n1-0/_flytecheckpoints>
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - app
    - task-name
    - timed_task
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: my-project:development:app.time_taken_wf
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: fc200e7ff601b48c2840
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: development
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_TASK_NAME
      value: app.timed_task
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: Q2f7IDWJG7-hRFbRNnsC6A
    - name: FLYTE_INTERNAL_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_NAME
      value: app.timed_task
    - name: FLYTE_INTERNAL_VERSION
      value: Q2f7IDWJG7-hRFbRNnsC6A
    - name: FLYTE_AWS_SECRET_ACCESS_KEY
      value: miniostorage
    - name: FLYTE_AWS_ENDPOINT
      value: <http://flyte-sandbox-minio.flyte:9000>
    - name: FLYTE_AWS_ACCESS_KEY_ID
      value: minio
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7|cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7>
    imagePullPolicy: IfNotPresent
    name: fc200e7ff601b48c2840-n1-0
    resources:
      limits:
        cpu: 500m
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wj5zn
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 8d60e8d92d1d
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-wj5zn
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:22:41Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:23:11Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:23:11Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:22:41Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <containerd://8d8057c8073021afe196cb95cd7ea80fd5b0e7bc80c1bf02f83f4f70b31e0ea>6
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7|cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7>
    imageID: <http://cr.flyte.org/flyteorg/flytekit@sha256:3f3d95c825a1ed4caf328422e77846853afc2da8c9201f312b57a92e085ed643|cr.flyte.org/flyteorg/flytekit@sha256:3f3d95c825a1ed4caf328422e77846853afc2da8c9201f312b57a92e085ed643>
    lastState: {}
    name: fc200e7ff601b48c2840-n1-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://8d8057c8073021afe196cb95cd7ea80fd5b0e7bc80c1bf02f83f4f70b31e0ea>6
        exitCode: 0
        finishedAt: "2024-05-26T08:23:10Z"
        reason: Completed
        startedAt: "2024-05-26T08:22:42Z"
  hostIP: 172.17.0.2
  phase: Succeeded
  podIP: 10.42.0.174
  podIPs:
  - ip: 10.42.0.174
  qosClass: Guaranteed
  startTime: "2024-05-26T08:22:41Z"

freezing-airport-6809

05/26/2024, 3:43 PM

There are ways of getting lower overhead - you have to read through the optimization guide to improve this. It’s a couple knobs. Important question- is this representative of your workload?

glamorous-rainbow-77959

05/27/2024, 7:21 AM

@freezing-airport-6809 which optimization guide you are referring to? Could you please share a link?

glamorous-rainbow-77959

05/27/2024, 7:46 AM

If this is this guide https://docs.flyte.org/en/latest/deployment/configuration/performance.html#deployment-configuration-performance we did not find any parameters that could reduce pod startup time here based on this simple

sleep

workflow example

clever-exabyte-82294

05/27/2024, 7:55 AM

container/ pod have started but it takes 10 second for application code to run. and yes it is representative of our workload

glamorous-rainbow-77959

05/27/2024, 1:18 PM

What we are looking for is reducing those ~15 seconds of startup time

Copy code

{"asctime": "2024-05-27 12:29:28,492", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-27 12:29:33,285", "name": "flytekit", "levelname": "INFO", "message": "Getting <s3://s3-flyte-dev/flyte-pai/development/QXUM4TQWNYYGP6QNEF2O3OM43Q======/script_mode.tar.gz> to ./"}
{"asctime": "2024-05-27 12:29:35,200", "name": "flytekit", "levelname": "INFO", "message": "Download data to local from <s3://s3-flyte-dev/flyte-pai/development/QXUM4TQWNYYGP6QNEF2O3OM43Q======/script_mode.tar.gz>. [Wall Time: 3.4110170670001025s, Process Time: 0.33784325000000004s]"}
{"asctime": "2024-05-27 12:29:35,281", "name": "flytekit", "levelname": "INFO", "message": "Download distribution. [Wall Time: 3.4918792429998575s, Process Time: 0.33875893199999996s]"}
{"asctime": "2024-05-27 12:29:40,083", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-27 12:29:43,090", "name": "flytekit", "levelname": "INFO", "message": "Welcome to Flyte! Version: 1.10.7"}
{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "DEBUG", "message": "Running task execution with resolver flytekit.core.python_auto_container.default_task_resolver..."}

{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "INFO", "message": "Using user directory /tmp/flyte-n3y5a2rf/sandbox/local_flytekit/eaf508b247b34c7e40f2df0e458d669d"}
{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "DEBUG", "message": "Checkpointer created with source None and dest <s3://s3-flyte-dev/data/95/fda3822f0c6c04ff78be-n0-0/_flytecheckpoints>"}

freezing-airport-6809

05/27/2024, 3:14 PM

Runtime performance is not Flyte backend it is flytekit

glamorous-rainbow-77959

05/28/2024, 5:31 AM

Ok, I understand that runtime performance is more related towards flytekit. Could you tell us if there is anything we can do to optimize the task cold start time which now takes about 15s of different flytekit operations?

freezing-airport-6809

05/28/2024, 6:04 AM

@glamorous-rainbow-77959 / @clever-exabyte-82294 I tired this right now, I am seeing a time of 1+ min for some node, where it had to launch a new machine. this is on a union cluster Here is the graph

freezing-airport-6809

05/28/2024, 6:05 AM

Also flyte did its best to co-ordinate and start runtime

glamorous-rainbow-77959

05/28/2024, 6:05 AM

So, it’s not reproducing on your end?

freezing-airport-6809

05/28/2024, 6:06 AM

i am diving deeper

freezing-airport-6809

05/28/2024, 6:06 AM

to find areas of speed up

freezing-airport-6809

05/28/2024, 6:06 AM

there is a timeline deck

freezing-airport-6809

05/28/2024, 6:06 AM

i think fast registration can have an overhead to run as it has to download your code

freezing-airport-6809

05/28/2024, 6:06 AM

this is why we recommend to use full containerized env for runtime

glamorous-rainbow-77959

05/28/2024, 6:06 AM

i think fast registration can have an overhead to run as it has to download your code

Yes, that’s one thing I was thinking about

freezing-airport-6809

05/28/2024, 6:07 AM

this can happen

freezing-airport-6809

05/28/2024, 6:07 AM

but the speed of iteration is key for development

freezing-airport-6809

05/28/2024, 6:08 AM

here is another example, where we optimize container startup times (this is union only - core infra). It goes even faster

👍 1

glamorous-rainbow-77959

05/28/2024, 6:08 AM

this is why we recommend to use full containerized env for runtime

Meaning to bake in our code in the docker image? Does Flyte support this?

freezing-airport-6809

05/28/2024, 6:08 AM

for runtime i mean in production

freezing-airport-6809

05/28/2024, 6:08 AM

yes Flyte supports this

freezing-airport-6809

05/28/2024, 6:08 AM

just use pyflyte register --non-fast

👀 1

freezing-airport-6809

05/28/2024, 6:10 AM

@glamorous-rainbow-77959 I notice -

@task(interruptible=True, retries=2)

freezing-airport-6809

05/28/2024, 6:10 AM

that is forcing it to move to spot machines too

freezing-airport-6809

05/28/2024, 6:10 AM

this can be slow right - to get a machine

glamorous-rainbow-77959

05/28/2024, 6:11 AM

this can be slow right - to get a machine

Yes, @freezing-airport-6809, but we are measuring the startup time only after pod was started by directly inspecting pod logs. So this startup time is excluded

glamorous-rainbow-77959

05/28/2024, 6:13 AM

https://flyte-org.slack.com/archives/CP2HDHKE1/p1716815888555819?thread_ts=1716635869.749209&cid=CP2HDHKE1 Here are the logs where you can see 15s of work taken by

flytekit

before running the actual task code

freezing-airport-6809

05/28/2024, 6:14 AM

So this is what I see

freezing-airport-6809

05/28/2024, 6:15 AM

I think fast registered might be reason, as Python startup can be slow

freezing-airport-6809

05/28/2024, 6:15 AM

This is why we are working on the rust core

glamorous-rainbow-77959

05/28/2024, 6:15 AM

I think fast registered might be reason, as Python startup can be slow

Hmm. So if we use non-fast registration this should go away? How it will affect python startup?

freezing-airport-6809

05/28/2024, 6:15 AM

this is because we restart the python vm couple times - we have a change that might speed it up dramatically. Just want to make sure perf is good

freezing-airport-6809

05/28/2024, 6:16 AM

i think so

freezing-airport-6809

05/28/2024, 6:16 AM

it will have only 1 startup. But its still python

freezing-airport-6809

05/28/2024, 6:17 AM

https://github.com/flyteorg/flytekit/pull/2307

freezing-airport-6809

05/28/2024, 6:17 AM

But happy to work with you to improve this

freezing-airport-6809

05/28/2024, 6:17 AM

this is one of our focus areas

freezing-airport-6809

05/28/2024, 6:18 AM

Also if you really care about perf - we do have a reusable container system in union, that makes it possible to run things in milliseconds

freezing-airport-6809

05/28/2024, 6:19 AM

@glamorous-rainbow-77959 sorry need to go to bed now 🙂 please use the timeline deck to find any problems. We will work on this in the next couple months

🙇🏽 1

👍 1

clever-exabyte-82294

05/29/2024, 11:34 AM

an update.. thanks Ketan. it reduced our runtime on remote from 1 min to 40 seconds. still the workload is ( sleep (25) ) so gap is 15 seconds. ( this might be acceptable depending on the team's feedback. ) major gap I see is here. around 5 seconds.

Copy code

{"asctime": "2024-05-29 05:36:45,420", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/flytekit/.flyte/config.yaml"}
{"asctime": "2024-05-29 05:36:47,414", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-29 05:36:51,011", "name": "flytekit", "levelname": "INFO", "message": "Welcome to Flyte! Version: 1.10.7"}

freezing-airport-6809

05/29/2024, 1:45 PM

This is interesting

freezing-airport-6809

05/29/2024, 1:46 PM

How did the agent get registered, you might have found a perf regression here Cc @glamorous-carpet-83516 we see agent registration and file lookup taking 7 seconds

freezing-airport-6809

05/29/2024, 1:46 PM

@clever-exabyte-82294 we will take a look. It does not happen all the time

glamorous-rainbow-77959

05/29/2024, 1:47 PM

just a comment to exclude the network: our EKS cluster and S3 bucket both are in ap-south-1

freezing-airport-6809

05/29/2024, 1:51 PM

Ya registering an agent is no network and at runtime it should not be happening

freezing-airport-6809

05/29/2024, 1:51 PM

Where did you get these logs from

glamorous-rainbow-77959

05/29/2024, 1:52 PM

those are just pod logs

freezing-airport-6809

05/29/2024, 1:52 PM

Then this is a bug

clever-exabyte-82294

05/29/2024, 1:52 PM

another 3 seconds are here. I set the environment variable FLYTE_SDK_LOGGING_LEVEL to get debug logs.

Copy code

{"asctime": "2024-05-29 05:36:51,015", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be found in yaml config"}
{"asctime": "2024-05-29 05:36:54,216", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Pushing context - execute, branch[False], StackOrigin(setup_execution, 299, /usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py)"}
{"asctime": "2024-05-29 05:36:54,312", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class 'dict'>"}

freezing-airport-6809

05/29/2024, 1:56 PM

Cc @high-accountant-32689

glamorous-carpet-83516

05/30/2024, 4:04 PM

I removed these dependencies from the default image, and it reduce 6~8 sec overhead. still investigating

freezing-airport-6809

05/31/2024, 3:28 AM

@clever-exabyte-82294 / @glamorous-rainbow-77959

freezing-airport-6809

05/31/2024, 4:30 AM

I actually do not think we need the pod plugin or even deck-standard. scikit learn was for an example, which is not the getting started example anymore right?

clever-exabyte-82294

05/31/2024, 4:38 AM

I can try using this updated Dockerfile as my base and measure the results. We don't need decks. we don't need to completely customize the pod also.

freezing-airport-6809

05/31/2024, 6:17 AM

You can customize pod

freezing-airport-6809

05/31/2024, 6:17 AM

That plugin is useless now - just legacy we should delete it cc @high-accountant-32689

clever-exabyte-82294

05/31/2024, 11:05 AM

it has lead an increase in remote cluster from 20 to 35. • An unknow increase of 3 sec. from the time container is ready to first DEBUG logs • an increase of 14 seconds for ~registering handlers for different classes. (~near registering handlers) depending on how to interpret logs.

freezing-airport-6809

05/31/2024, 2:10 PM

Can you run a few examples

freezing-airport-6809

05/31/2024, 2:10 PM

It seems the real loss took 14 seconds

freezing-airport-6809

05/31/2024, 2:11 PM

So your ~~talk gas~~ task has a lot of imports

clever-exabyte-82294

05/31/2024, 6:36 PM

talk gas a lot of imports?

didn't understand this.

freezing-airport-6809

05/31/2024, 8:39 PM

oops, typo

glamorous-rainbow-77959

06/20/2024, 12:03 PM

@freezing-airport-6809 @high-accountant-32689 hi. Could you tell us on if there’s any progress on this issue?

freezing-airport-6809

06/20/2024, 2:09 PM

I am unsure of the issue. I thought at the end we identified it was your code / imports?

clever-exabyte-82294

06/20/2024, 2:36 PM

I believe 2 things were discussed • Registering an agent for task type: sensor, name: Sensor : taking some time ◦ This issue was recognized as perf regression, but no solution is provided • removing dependencies from DockerFile ◦ I have tried this and tried it multiple times on local system as well as on remote cluster ◦ the outcome is increase in time instead of decrease, as I suggested before. ~~Haven't able to do more testing around it. hence didn't responded. but we can pick up now.~~ I had tested on local and remote machines. Result was similar with multiple runs and different workflows.

freezing-airport-6809

06/20/2024, 2:39 PM

Cc @glamorous-carpet-83516 did some investigating and removed the sensor calls, but if at all there should be an issue to track. Slack threads are not great

clever-exabyte-82294

06/25/2024, 8:19 AM

Should I create a Github issue? or there is some other tracking mechanism? Is there a plan to release the fix with Flyte Release cycle? One more Team currently in our organisation is evaluating Flyte. With the startup time getting add up for each task. Any complex workflow with 6 steps will take 2 mins just for starting up ( 6 * 20 seconds)

👍 1

64 Views

Open in Slack

Previous Next