Hi team I need some help I have a flyte task that needs to r Flyte #flyte-support

Hi team.. I need some help. I have a flyte task th...

little-cricket-84530

12/28/2022, 5:23 AM

Hi team.. I need some help. I have a flyte task that needs to read a parquet file stored in s3 (about 2 GB in size, 100k rows) and return 1000 records from it… I’m using awswrangler for reading the file into a pandas dataframe and then return the first 1000 records from it (list of dicts). Each task node has upto 8 GB in memory… but the pod keeps getting OOM killed with no useful info in the stack trace. Not sure what’s going on…

glamorous-carpet-83516

12/28/2022, 6:15 AM

is 8GB task resource request or limit?

glamorous-carpet-83516

12/28/2022, 6:16 AM

could you check the pods spec by using

kubectl describe

? just want to make sure flyte set the resource correctly

little-cricket-84530

12/28/2022, 6:17 AM

sorry.. not 8.. 18 Here’s a snippet of our helm chart

Copy code

# -- Task default resources configuration
  task_resource_defaults:
    task_resources:
      defaults:
        cpu: 100m
        memory: 200Mi
        storage: 200Mi
        ephemeral-storage: 1Gi
        gpu: 0
      limits:
        cpu: 9
        memory: 18Gi
        storage: 20Gi
        ephemeral-storage: 28Gi
        gpu: 0

little-cricket-84530

12/28/2022, 6:18 AM

so yes.. that’s the limit

glamorous-carpet-83516

12/28/2022, 6:19 AM

how about the pod spec? could you show me the output of kubectl get pods <name> -n flytesnack-development

little-cricket-84530

12/28/2022, 6:27 AM

Sent you the screen shot… Meanwhile here are the sanitized logs from a pod that got OOMKilled

Copy code

{"asctime": "2022-12-28 05:20:19,825", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
{"asctime": "2022-12-28 05:20:25,924", "name": "flytekit", "levelname": "INFO", "message": "We won't register PyTorchCheckpointTransformer, PyTorchTensorTransformer, and PyTorchModuleTransformer because torch is not installed."}
{"asctime": "2022-12-28 05:20:26,124", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
{"asctime": "2022-12-28 05:20:26,124", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
{"asctime": "2022-12-28 05:20:26,125", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}
{"asctime": "2022-12-28 05:20:26,125", "name": "flytekit", "levelname": "INFO", "message": "Setting protocol to file"}

{"asctime": "2022-12-28 05:20:27,733", "name": "flytekit.entrypoint", "levelname": "INFO", "message": "Welcome to Flyte! Version: 1.2.1"}
{"asctime": "2022-12-28 05:20:27,734", "name": "flytekit.entrypoint", "levelname": "INFO", "message": "Using user directory /tmp/flyte-i8059ewt/sandbox/local_flytekit/9f57ef630f6df792206a67f0b4e593c5"}
{"asctime": "2022-12-28 05:20:32,830", "name": "flytekit", "levelname": "INFO", "message": "Entering timed context: Copying (<s3://my-flyte/flyte/metadata/propeller/project-development-agbd42n8ph94v8rwcbgv/n0/data/inputs.pb> -> /tmp/flyte-i8059ewt/sandbox/local_flytekit/inputs.pb)"}
{"asctime": "2022-12-28 05:20:41,531", "name": "flytekit", "levelname": "INFO", "message": "Output of command '['aws', 's3', 'cp', '<s3://my-flyte/flyte/metadata/propeller/project-development-agbd42n8ph94v8rwcbgv/n0/data/inputs.pb>', '/tmp/flyte-i8059ewt/sandbox/local_flytekit/inputs.pb']':\nb'Completed 130 Bytes/130 Bytes (697 Bytes/s) with 1 file(s) remaining\\rdownload: <s3://my-flyte/flyte/metadata/propeller/project-development-agbd42n8ph94v8rwcbgv/n0/data/inputs.pb> to ../tmp/flyte-i8059ewt/sandbox/local_flytekit/inputs.pb\\n'\n"}
{"asctime": "2022-12-28 05:20:41,532", "name": "flytekit", "levelname": "INFO", "message": "Exiting timed context: Copying (<s3://my-flyte/flyte/metadata/propeller/project-development-agbd42n8ph94v8rwcbgv/n0/data/inputs.pb> -> /tmp/flyte-i8059ewt/sandbox/local_flytekit/inputs.pb) [Wall Time: 8.701755479996791s, Process Time: 0.027927802s]"}
{"asctime": "2022-12-28 05:20:41,534", "name": "flytekit", "levelname": "INFO", "message": "Invoking flyte.workflows.my_workflow.read_input_records with inputs: {'num_requests': 1000, 'input_path': '<s3://my_bucket/some_file_name.parquet>'}"}

glamorous-carpet-83516

12/28/2022, 6:29 AM

sorry, I said something wrong. should be kubectl describe …

little-cricket-84530

12/28/2022, 6:32 AM

I’m seeing this in the describe pod output

Copy code

Limits:
      cpu:     100m
      memory:  200Mi
    Requests:
      cpu:     100m
      memory:  200Mi

little-cricket-84530

12/28/2022, 6:32 AM

wondering why the 18 GB limit didn’t take effect?

glamorous-carpet-83516

12/28/2022, 6:34 AM

could you check the propeller config map? just want to make sure the resource is 18GB in that configmap.

glamorous-carpet-83516

12/28/2022, 6:35 AM

btw, you’re running regular task, right? not spark or tensorflow job

little-cricket-84530

12/28/2022, 6:35 AM

python task

little-cricket-84530

12/28/2022, 6:36 AM

kubectl describe pod flytepropeller-6984f5cd7-ws65t

Copy code

Limits:
      cpu:                200m
      ephemeral-storage:  100Mi
      memory:             200Mi
    Requests:
      cpu:                10m
      ephemeral-storage:  50Mi
      memory:             100Mi

little-cricket-84530

12/28/2022, 6:37 AM

is that what you wanted?

glamorous-carpet-83516

12/28/2022, 6:37 AM

no, kubectl edit cm <propeller config map>

glamorous-carpet-83516

12/28/2022, 6:37 AM

you can use kubectl get cm to get the name first

little-cricket-84530

12/28/2022, 6:38 AM

ah.. hang on

little-cricket-84530

12/28/2022, 6:42 AM

Copy code

k8s.yaml: |
    plugins:
      k8s:
        default-cpus: 100m
        default-env-vars: []
        default-memory: 100Mi

little-cricket-84530

12/28/2022, 6:42 AM

I can’t seem to find any mention of the 18 GB

glamorous-carpet-83516

12/28/2022, 6:45 AM

sorry again, flyteadmin config map

little-cricket-84530

12/28/2022, 6:46 AM

kubectl edit cm flyte-admin-base-config

Copy code

task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 100m
        ephemeral-storage: 1Gi
        gpu: 0
        memory: 200Mi
        storage: 200Mi
      limits:
        cpu: 9
        ephemeral-storage: 28Gi
        gpu: 0
        memory: 18Gi
        storage: 20Gi

glamorous-carpet-83516

12/28/2022, 6:54 AM

one sec

glamorous-carpet-83516

12/28/2022, 7:00 AM

have you changed the resource in task decorator?

little-cricket-84530

12/28/2022, 7:00 AM

no… didn’t realize I needed to do that

glamorous-carpet-83516

12/28/2022, 7:01 AM

no, you don’t. just to confirm you didn’t change that

glamorous-carpet-83516

12/28/2022, 7:10 AM

oh. i see. flyteadmin will set both resource and limit to the value of

task_resources.defaults.

To change the limit value, you have to update the limit in task decorator.

little-cricket-84530

12/28/2022, 7:11 AM

is it the task_config parameter?

little-cricket-84530

12/28/2022, 7:13 AM

do you have an example for how to set

limits

in the decorator? There’s also

requests

in the decorator

glamorous-carpet-83516

12/28/2022, 7:13 AM

yes. https://github.com/flyteorg/flytekit/blob/59443b8f4a6a8a1ba6015b879b7324db3ab53d36/plugins/flytekit-k8s-pod/tests/test_pod.py#L254

little-cricket-84530

12/28/2022, 7:13 AM

thanks.. so do I set both?

glamorous-carpet-83516

12/28/2022, 7:14 AM

I think you only need to set limit. if it doesn’t work, try to set both

glamorous-carpet-83516

12/28/2022, 7:16 AM

cc @freezing-airport-6809 why we can’t use the limit in the

task_resources

by default

little-cricket-84530

12/28/2022, 7:23 AM

No crash so far 🤞🏼

little-cricket-84530

12/28/2022, 7:27 AM

ok.. that worked

👍 1

little-cricket-84530

12/28/2022, 7:31 AM

Thank you!

freezing-airport-6809

12/28/2022, 7:54 AM

@little-cricket-84530 can you help Document what you had to do

little-cricket-84530

12/28/2022, 5:13 PM

Sure.. but not sure why on earth that task took 4.5 hours??? All it does is read a 2 GB parquet file into a pandas data frame and return the first 1000 records. From the logs there appears to be a lot of reading and writing to s3 in tiny chunks (600 bytes here, 7 KB there and so on) Example

Copy code

Exiting timed context: Writing (/tmp/flyte2qtcr_i0/local_flytekit/f4484561b67506b8c8bfe18c1b4a315d -> <s3://my-flyte/2x/axd9n8qhbwlgszfb9slf-n0-0/939989a9a40c75f65c631bdf11070354/f4484561b67506b8c8bfe18c1b4a315d>) [Wall Time: 6.298165863845497s, Process Time: 0.029386856000002126s]
1614	Output of command '['aws', 's3', 'cp', '--acl', 'bucket-owner-full-control', '/tmp/flyte2qtcr_i0/local_flytekit/f4484561b67506b8c8bfe18c1b4a315d', '<s3://my-flyte/2x/axd9n8qhbwlgszfb9slf-n0-0/939989a9a40c75f65c631bdf11070354/f4484561b67506b8c8bfe18c1b4a315d']'>:
b'Completed 8.9 KiB/8.9 KiB (47.7 KiB/s) with 1 file(s) remaining\rupload: ../tmp/flyte2qtcr_i0/local_flytekit/f4484561b67506b8c8bfe18c1b4a315d to <s3://tesla-flyte/2x/axd9n8qhbwlgszfb9slf-n0-0/939989a9a40c75f65c631bdf11070354/f4484561b67506b8c8bfe18c1b4a315d>\n'
1615	Entering timed context: Writing (/tmp/flyte2qtcr_i0/local_flytekit/8016c26332f3085b27a3d08692b5df94 -> <s3://my-flyte/2x/axd9n8qhbwlgszfb9slf-n0-0/ebb807108d3cd57bbfff58b978a8fa22/8016c26332f3085b27a3d08692b5df94>)

little-cricket-84530

12/28/2022, 5:14 PM

@freezing-airport-6809 where would you like me to add the documentation?

little-cricket-84530

12/28/2022, 6:31 PM

@glamorous-carpet-83516 any idea what’s causing the slowdown?

glamorous-carpet-83516

12/28/2022, 6:37 PM

Might be Flyte deck issue

glamorous-carpet-83516

12/28/2022, 6:37 PM

Did you disable it? @task(disable_deck=True….

little-cricket-84530

12/28/2022, 6:38 PM

let me try that.. I saw that on another thread but wasn’t sure if it applied to my use case

little-cricket-84530

12/28/2022, 7:11 PM

ok.. still super slow. This runs faster on my desktop as well as laptop 😞

glamorous-carpet-83516

12/28/2022, 7:13 PM

btw, is fsspec package installed in your image? There are some issues in fsspec we just found that. it will download the files serially

little-cricket-84530

12/28/2022, 7:14 PM

yes.. I do have fsspec

glamorous-carpet-83516

12/28/2022, 7:16 PM

sorry, could you remove that? and then try to run the workflow again. I’m going to fix the bug in fsspec plugin

little-cricket-84530

12/28/2022, 7:16 PM

I’ll remove it for now in my branch.. since this specific workflow doesn’t need it..

glamorous-carpet-83516

12/28/2022, 7:18 PM

thanks

little-cricket-84530

12/28/2022, 8:25 PM

@glamorous-carpet-83516 it looks like the bottleneck is writing to s3/flyte’s ephemeral storage when the task returns.. at this point it’s just 1000 rows

freezing-airport-6809

12/28/2022, 8:26 PM

hmm this is really slow

freezing-airport-6809

12/28/2022, 8:26 PM

@glamorous-carpet-83516 would be great to dig into this

little-cricket-84530

12/28/2022, 8:26 PM

15 min and counting….

freezing-airport-6809

12/28/2022, 8:26 PM

does not make sense

glamorous-carpet-83516

12/28/2022, 8:27 PM

I’m looking into it

little-cricket-84530

12/28/2022, 8:29 PM

I’ll be happy to share the code of the task (nothing proprietary in there).. if you want to dig in

glamorous-carpet-83516

12/28/2022, 8:29 PM

Yes, that help

little-cricket-84530

12/28/2022, 8:32 PM

Copy code

import awswrangler as wr
import pandas as pd
import boto3
from time import perf_counter

@task(
    #cache=True,
    #cache_version="1.0",
    limits=Resources(mem="10Gi"),
    disable_deck=True,
    secret_requests=[
        Secret(group=SECRET_GROUP, key="aws_secret_access_key"),
        Secret(group=SECRET_GROUP, key="aws_access_key_id"),
    ],
)
def read_input_records(input_path: str, num_requests: int) -> List[Dict[str, Any]]:
    aws_access_key_id = flytekit.current_context().secrets.get(
        SECRET_GROUP, "aws_access_key_id"
    )
    aws_secret_access_key = flytekit.current_context().secrets.get(
        SECRET_GROUP, "aws_secret_access_key"
    )
    session = boto3.Session(
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name=AWS_REGION,
    )
    start = perf_counter()
    df = wr.s3.read_parquet(path=input_path, boto3_session=session)
    stop = perf_counter()
    flytekit.current_context().<http://logging.info|logging.info>(
        f"Read {len(df)} records in {stop - start} s"
    )

    if num_requests != -1 and num_requests < len(df):
        df = df[:num_requests]
    stop_2 = perf_counter()
    flytekit.current_context().<http://logging.info|logging.info>(
        f"Truncated to {len(df)} records in {stop_2 - stop} s"
    )

    records = df.to_dict("records")

    stop_3 = perf_counter()
    flytekit.current_context().<http://logging.info|logging.info>(
        f"Converted to records in {stop_3 - stop_2} s"
    )

    return records

Please adjust aws credentials etc as per your setup

glamorous-carpet-83516

12/28/2022, 11:36 PM

Has the task been completed ? IIRC, if you use a large list (10000+) as output, flyte will take a long time (15 mins+) to construct a large protobuf list, and save it. However, saving size 1000 of a list should be really fast What is the number of columns in the dataframe? The record should be 1000 (list size)*N (number of column), right? To download the file in the pod, awswrangler should spend an equal amount of time as local, so the issue might only happen when we save the record or construct a large protobuf list

glamorous-carpet-83516

12/29/2022, 12:27 AM

@little-cricket-84530 I found the issue. the problem is that flyte doesn’t recognize the

Any

, so it will be serialized the data to pickle, which means you will have thousand of pickle files. In addition, flytekit will upload the file serially, so it become very slow.

glamorous-carpet-83516

12/29/2022, 12:27 AM

could you change

any

to python type?

glamorous-carpet-83516

12/29/2022, 12:28 AM

FYI, I can reproduce it with this code

Copy code

import typing
from typing import List, Dict, Any

import numpy as np
import pandas as pd
from flytekit import task, Resources, workflow, StructuredDataset


@task(
    limits=Resources(mem="4Gi"),
    disable_deck=True,
)
def create_parquet() -> StructuredDataset:
    df = pd.DataFrame(np.random.choice(['foo','bar','baz'], size=(100000, 3)), columns=["a", "b", "c"])
    df = df.apply(lambda col: col.astype('category'))

    return StructuredDataset(dataframe=df)


@task(
    limits=Resources(mem="4Gi"),
    disable_deck=True,
)
def read_input_records(sd: StructuredDataset) -> List[Dict[str, Any]]:
    df = sd.open(pd.DataFrame).all()
    records = typing.cast(pd.DataFrame, df).to_dict()
    return [records]*1000  # 1000 * 3


@workflow
def wf():
    sd = create_parquet()
    read_input_records(sd=sd)


if __name__ == "__main__":
    wf()

little-cricket-84530

12/29/2022, 3:31 AM

thanks a lot.. ok.. let me adjust the type

freezing-airport-6809

12/29/2022, 3:50 AM

Ohh we need to somehow show this

little-cricket-84530

12/29/2022, 7:18 AM

I think there was some message about using Pickle for typing.Any… but would be nice to throw this into the docs somewhere as a real warning

swift-animal-75798

12/29/2022, 9:09 AM

As for the task_resource limits that was brought up earlier in this thread

Copy code

task_resource_defaults.yaml: |
    task_resources:
      defaults:
        cpu: 100m
        ephemeral-storage: 1Gi
        gpu: 0
        memory: 200Mi
        storage: 200Mi
      limits:
        cpu: 9
        ephemeral-storage: 28Gi
        gpu: 0
        memory: 18Gi
        storage: 20Gi

The

task_resources.defaults

refers to the defaults limits that apply on any POD. The

task.resources.limits

refers to the upper limit that can be set from the user's repo. If no request/limit is set on a task, the task_resources.defaults are applied as both. Users cannot set higher limits for their tasks than the

task_resources.limits

👍 1

glamorous-carpet-83516

12/29/2022, 8:08 PM

@swift-animal-75798 Thanks for the explain.

575 Views

Open in Slack

Previous Next