<@U01J90KBSU9> What is the expected minimum run_ti...
# flyte-support
c
@tall-lock-23197 What is the expected minimum run_time for a flyte pod. My task : • print the current epoch time and sleep for 25 seconds. • Time taken on local sandbox = 40 seconds • Time taken on remote cluster = 1 min. The pod is taking too much extra for no apparent reason. I am using the default Docker Image: cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7
f
This depends on a lot of parameters
If you see 1 min is the runtime of the python code
This could be because of data - python env etc
This is where we are working on rust runtime to make things go faster
Also data movement from s3 etc can be slow
You can enable deck and timeline deck to see where it is spending time
If you want fast execution in Union we have a way to make workflows and task to in milliseconds
Happy to demo
g
could you share the code snippet to repro
c
The Code is just print and sleep statement.
Copy code
import time
from datetime import datetime
import pytz
from flytekit import Resources, task, workflow


def dt():
    utc_now = datetime.now(pytz.utc)
    print(str(utc_now))
    return time.time()

@task(interruptible=True, retries=2)
def timed_task(a: int) -> dict:
    a1 = dt()
    print("Hello, I should  take time ", a, a1)
    time.sleep(a)
    a2 = dt()
    print("I am done", a2)
    return {"a":a, "diff": a2-a1}

@workflow
def time_taken_wf() -> None:
    timed_task(a=20) >> timed_task(a=20) >> timed_task(a=20)
    timed_task(a=25)
    timed_task(a=40)
    timed_task(a=100)
I am using default docker image.
yaml file for the pod.
Copy code
apiVersion: v1
kind: Pod
metadata:
  annotations:
    <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: "false"
    primary_container_name: fc200e7ff601b48c2840-n1-0
  creationTimestamp: "2024-05-26T08:22:34Z"
  labels:
    domain: development
    execution-id: fc200e7ff601b48c2840
    interruptible: "true"
    my-label: my-project
    node-id: n1
    project: my-project
    shard-key: "16"
    task-name: app-timed-task
    workflow-name: app-time-taken-wf
  name: fc200e7ff601b48c2840-n1-0
  namespace: my-project-development
  ownerReferences:
  - apiVersion: <http://flyte.lyft.com/v1alpha1|flyte.lyft.com/v1alpha1>
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: fc200e7ff601b48c2840
    uid: 16a7a111-fbbd-4f52-afb5-d3878733616e
  resourceVersion: "12158"
  uid: 567d4ee4-7dba-4140-82b1-7ec4684d1207
spec:
  affinity: {}
  containers:
  - args:
    - pyflyte-fast-execute
    - --additional-distribution
    - <s3://my-s3-bucket/my-project/development/EBR637LASIENUKDJPWFUXA47K4======/script_mode.tar.gz>
    - --dest-dir
    - .
    - --
    - pyflyte-execute
    - --inputs
    - <s3://my-s3-bucket/metadata/propeller/my-project-development-fc200e7ff601b48c2840/n1/data/inputs.pb>
    - --output-prefix
    - <s3://my-s3-bucket/metadata/propeller/my-project-development-fc200e7ff601b48c2840/n1/data/0>
    - --raw-output-data-prefix
    - <s3://my-s3-bucket/data/sd/fc200e7ff601b48c2840-n1-0>
    - --checkpoint-path
    - <s3://my-s3-bucket/data/sd/fc200e7ff601b48c2840-n1-0/_flytecheckpoints>
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - app
    - task-name
    - timed_task
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: my-project:development:app.time_taken_wf
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: fc200e7ff601b48c2840
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: development
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_TASK_NAME
      value: app.timed_task
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: Q2f7IDWJG7-hRFbRNnsC6A
    - name: FLYTE_INTERNAL_PROJECT
      value: my-project
    - name: FLYTE_INTERNAL_DOMAIN
      value: development
    - name: FLYTE_INTERNAL_NAME
      value: app.timed_task
    - name: FLYTE_INTERNAL_VERSION
      value: Q2f7IDWJG7-hRFbRNnsC6A
    - name: FLYTE_AWS_SECRET_ACCESS_KEY
      value: miniostorage
    - name: FLYTE_AWS_ENDPOINT
      value: <http://flyte-sandbox-minio.flyte:9000>
    - name: FLYTE_AWS_ACCESS_KEY_ID
      value: minio
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7|cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7>
    imagePullPolicy: IfNotPresent
    name: fc200e7ff601b48c2840-n1-0
    resources:
      limits:
        cpu: 500m
        memory: 1Gi
      requests:
        cpu: 500m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-wj5zn
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 8d60e8d92d1d
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: <http://node.kubernetes.io/not-ready|node.kubernetes.io/not-ready>
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: <http://node.kubernetes.io/unreachable|node.kubernetes.io/unreachable>
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-wj5zn
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:22:41Z"
    reason: PodCompleted
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:23:11Z"
    reason: PodCompleted
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:23:11Z"
    reason: PodCompleted
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-05-26T08:22:41Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: <containerd://8d8057c8073021afe196cb95cd7ea80fd5b0e7bc80c1bf02f83f4f70b31e0ea>6
    image: <http://cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7|cr.flyte.org/flyteorg/flytekit:py3.11-1.10.7>
    imageID: <http://cr.flyte.org/flyteorg/flytekit@sha256:3f3d95c825a1ed4caf328422e77846853afc2da8c9201f312b57a92e085ed643|cr.flyte.org/flyteorg/flytekit@sha256:3f3d95c825a1ed4caf328422e77846853afc2da8c9201f312b57a92e085ed643>
    lastState: {}
    name: fc200e7ff601b48c2840-n1-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: <containerd://8d8057c8073021afe196cb95cd7ea80fd5b0e7bc80c1bf02f83f4f70b31e0ea>6
        exitCode: 0
        finishedAt: "2024-05-26T08:23:10Z"
        reason: Completed
        startedAt: "2024-05-26T08:22:42Z"
  hostIP: 172.17.0.2
  phase: Succeeded
  podIP: 10.42.0.174
  podIPs:
  - ip: 10.42.0.174
  qosClass: Guaranteed
  startTime: "2024-05-26T08:22:41Z"
f
There are ways of getting lower overhead - you have to read through the optimization guide to improve this. It’s a couple knobs. Important question- is this representative of your workload?
g
@freezing-airport-6809 which optimization guide you are referring to? Could you please share a link?
If this is this guide https://docs.flyte.org/en/latest/deployment/configuration/performance.html#deployment-configuration-performance we did not find any parameters that could reduce pod startup time here based on this simple
sleep
workflow example
c
container/ pod have started but it takes 10 second for application code to run. and yes it is representative of our workload
g
What we are looking for is reducing those ~15 seconds of startup time
Copy code
{"asctime": "2024-05-27 12:29:28,492", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-27 12:29:33,285", "name": "flytekit", "levelname": "INFO", "message": "Getting <s3://s3-flyte-dev/flyte-pai/development/QXUM4TQWNYYGP6QNEF2O3OM43Q======/script_mode.tar.gz> to ./"}
{"asctime": "2024-05-27 12:29:35,200", "name": "flytekit", "levelname": "INFO", "message": "Download data to local from <s3://s3-flyte-dev/flyte-pai/development/QXUM4TQWNYYGP6QNEF2O3OM43Q======/script_mode.tar.gz>. [Wall Time: 3.4110170670001025s, Process Time: 0.33784325000000004s]"}
{"asctime": "2024-05-27 12:29:35,281", "name": "flytekit", "levelname": "INFO", "message": "Download distribution. [Wall Time: 3.4918792429998575s, Process Time: 0.33875893199999996s]"}
{"asctime": "2024-05-27 12:29:40,083", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-27 12:29:43,090", "name": "flytekit", "levelname": "INFO", "message": "Welcome to Flyte! Version: 1.10.7"}
{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "DEBUG", "message": "Running task execution with resolver flytekit.core.python_auto_container.default_task_resolver..."}

{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "INFO", "message": "Using user directory /tmp/flyte-n3y5a2rf/sandbox/local_flytekit/eaf508b247b34c7e40f2df0e458d669d"}
{"asctime": "2024-05-27 12:29:43,091", "name": "flytekit", "levelname": "DEBUG", "message": "Checkpointer created with source None and dest <s3://s3-flyte-dev/data/95/fda3822f0c6c04ff78be-n0-0/_flytecheckpoints>"}
f
Runtime performance is not Flyte backend it is flytekit
g
Ok, I understand that runtime performance is more related towards flytekit. Could you tell us if there is anything we can do to optimize the task cold start time which now takes about 15s of different flytekit operations?
f
@glamorous-rainbow-77959 / @clever-exabyte-82294 I tired this right now, I am seeing a time of 1+ min for some node, where it had to launch a new machine. this is on a union cluster Here is the graph
Also flyte did its best to co-ordinate and start runtime
g
So, it’s not reproducing on your end?
f
i am diving deeper
to find areas of speed up
there is a timeline deck
i think fast registration can have an overhead to run as it has to download your code
this is why we recommend to use full containerized env for runtime
g
i think fast registration can have an overhead to run as it has to download your code
Yes, that’s one thing I was thinking about
f
this can happen
but the speed of iteration is key for development
here is another example, where we optimize container startup times (this is union only - core infra). It goes even faster
👍 1
g
this is why we recommend to use full containerized env for runtime
Meaning to bake in our code in the docker image? Does Flyte support this?
f
for runtime i mean in production
yes Flyte supports this
just use pyflyte register --non-fast
👀 1
@glamorous-rainbow-77959 I notice -
@task(interruptible=True, retries=2)
that is forcing it to move to spot machines too
this can be slow right - to get a machine
g
this can be slow right - to get a machine
Yes, @freezing-airport-6809, but we are measuring the startup time only after pod was started by directly inspecting pod logs. So this startup time is excluded
https://flyte-org.slack.com/archives/CP2HDHKE1/p1716815888555819?thread_ts=1716635869.749209&amp;cid=CP2HDHKE1 Here are the logs where you can see 15s of work taken by
flytekit
before running the actual task code
f
So this is what I see
I think fast registered might be reason, as Python startup can be slow
This is why we are working on the rust core
g
I think fast registered might be reason, as Python startup can be slow
Hmm. So if we use non-fast registration this should go away? How it will affect python startup?
f
this is because we restart the python vm couple times - we have a change that might speed it up dramatically. Just want to make sure perf is good
i think so
it will have only 1 startup. But its still python
But happy to work with you to improve this
this is one of our focus areas
Also if you really care about perf - we do have a reusable container system in union, that makes it possible to run things in milliseconds
@glamorous-rainbow-77959 sorry need to go to bed now 🙂 please use the timeline deck to find any problems. We will work on this in the next couple months
🙇🏽 1
👍 1
c
an update.. thanks Ketan. it reduced our runtime on remote from 1 min to 40 seconds. still the workload is ( sleep (25) ) so gap is 15 seconds. ( this might be acceptable depending on the team's feedback. ) major gap I see is here. around 5 seconds.
Copy code
{"asctime": "2024-05-29 05:36:45,420", "name": "flytekit", "levelname": "INFO", "message": "Using flytectl/YAML config /home/flytekit/.flyte/config.yaml"}
{"asctime": "2024-05-29 05:36:47,414", "name": "flytekit", "levelname": "INFO", "message": "Registering an agent for task type: sensor, name: Sensor"}
{"asctime": "2024-05-29 05:36:51,011", "name": "flytekit", "levelname": "INFO", "message": "Welcome to Flyte! Version: 1.10.7"}
f
This is interesting
How did the agent get registered, you might have found a perf regression here Cc @glamorous-carpet-83516 we see agent registration and file lookup taking 7 seconds
@clever-exabyte-82294 we will take a look. It does not happen all the time
g
just a comment to exclude the network: our EKS cluster and S3 bucket both are in ap-south-1
f
Ya registering an agent is no network and at runtime it should not be happening
Where did you get these logs from
g
those are just pod logs
f
Then this is a bug
c
another 3 seconds are here. I set the environment variable FLYTE_SDK_LOGGING_LEVEL to get debug logs.
Copy code
{"asctime": "2024-05-29 05:36:51,015", "name": "flytekit", "levelname": "DEBUG", "message": "Switch storage.connection.secret-key could not be found in yaml config"}
{"asctime": "2024-05-29 05:36:54,216", "name": "flytekit", "levelname": "DEBUG", "message": "\t\t[2] Pushing context - execute, branch[False], StackOrigin(setup_execution, 299, /usr/local/lib/python3.11/site-packages/flytekit/bin/entrypoint.py)"}
{"asctime": "2024-05-29 05:36:54,312", "name": "flytekit", "levelname": "DEBUG", "message": "Task returns unnamed native tuple <class 'dict'>"}
f
Cc @high-accountant-32689
g
I removed these dependencies from the default image, and it reduce 6~8 sec overhead. still investigating
f
@clever-exabyte-82294 / @glamorous-rainbow-77959
I actually do not think we need the pod plugin or even deck-standard. scikit learn was for an example, which is not the getting started example anymore right?
c
I can try using this updated Dockerfile as my base and measure the results. We don't need decks. we don't need to completely customize the pod also.
f
You can customize pod
That plugin is useless now - just legacy we should delete it cc @high-accountant-32689
c
it has lead an increase in remote cluster from 20 to 35. • An unknow increase of 3 sec. from the time container is ready to first DEBUG logs • an increase of 14 seconds for ~registering handlers for different classes. (~near registering handlers) depending on how to interpret logs.
f
Can you run a few examples
It seems the real loss took 14 seconds
So your talk gas task has a lot of imports
c
talk gas a lot of imports?
didn't understand this.
f
oops, typo
g
@freezing-airport-6809 @high-accountant-32689 hi. Could you tell us on if there’s any progress on this issue?
f
I am unsure of the issue. I thought at the end we identified it was your code / imports?
c
I believe 2 things were discussed • Registering an agent for task type: sensor, name: Sensor : taking some time ◦ This issue was recognized as perf regression, but no solution is provided • removing dependencies from DockerFile ◦ I have tried this and tried it multiple times on local system as well as on remote cluster ◦ the outcome is increase in time instead of decrease, as I suggested before. Haven't able to do more testing around it. hence didn't responded. but we can pick up now. I had tested on local and remote machines. Result was similar with multiple runs and different workflows.
f
Cc @glamorous-carpet-83516 did some investigating and removed the sensor calls, but if at all there should be an issue to track. Slack threads are not great
c
Should I create a Github issue? or there is some other tracking mechanism? Is there a plan to release the fix with Flyte Release cycle? One more Team currently in our organisation is evaluating Flyte. With the startup time getting add up for each task. Any complex workflow with 6 steps will take 2 mins just for starting up ( 6 * 20 seconds)
👍 1