hello i am trying to run a workflow which has a task step th Flyte #flyte-support

hello! i am trying to run a workflow which has a t...

plain-carpenter-67621

06/27/2022, 9:11 PM

hello! i am trying to run a workflow which has a task/step that fetches data from sandbox’s Minio and splits it into train/test. But the pod keeps dying with this error message

Copy code

Pod failed. No message received from kubernetes.
[atg6lnllwptbr4j6thwc-n0-0] terminated with exit code (137). Reason [Error]. Message: 

.

the flyte cluster is local sandbox cluster

acceptable-policeman-57188

06/27/2022, 11:02 PM

could this be an OOM? do you see anything in the console for your execution? can you try configuring the task with resources and bumping the memory from the default?

plain-carpenter-67621

06/28/2022, 2:03 AM

yes, i did try maxing out the limits and requests for the failing task. Description of failed pod below.

Copy code

Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi

plain-carpenter-67621

06/28/2022, 2:04 AM

also is

1Gi

the max limit i can set?

tall-lock-23197

06/28/2022, 4:17 AM

@plain-carpenter-67621, is that all you see in the pod log? Have you tried

kubectl describe po <pod-name> -n flytesnacks-development

plain-carpenter-67621

06/28/2022, 4:40 AM

yes. i have tried describing the pod, the description does not give any specific detail as to why the pod might be crashing 😕

acceptable-policeman-57188

06/28/2022, 3:35 PM

is there anything in the console that indicates the failure reason? we do our best to interpret the pod status

acceptable-policeman-57188

06/28/2022, 3:36 PM

can you share the task decorator with the increased resource requests and limits?

plain-carpenter-67621

06/28/2022, 3:37 PM

@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))

plain-carpenter-67621

06/28/2022, 3:37 PM

and no.. i dont see anything in the console that indicates failure reason besides the Failure msg i posted at the start

acceptable-policeman-57188

06/28/2022, 3:40 PM

how much data is the task fetching? maybe 1Gi isn't enough?

plain-carpenter-67621

06/28/2022, 3:41 PM

hardly 40mb

plain-carpenter-67621

06/28/2022, 3:41 PM

how can i bump it to more than 1G?

acceptable-policeman-57188

06/28/2022, 3:42 PM

you can update the limit in the config I shared above and then re-register the task with an increased limit in the decorator

acceptable-policeman-57188

06/28/2022, 3:42 PM

but the pod shouldn't be OOMing really if it's 40mb. can you share your task definition?

plain-carpenter-67621

06/28/2022, 3:43 PM

in the config.. storage is set to

20Mi

that seems low

Copy code

"task_resource_defaults.yaml": "task_resources:
		  defaults:
		    cpu: 100m
		    memory: 500Mi
		    storage: 500Mi
		  limits:
		    cpu: 2
		    gpu: 1
		    memory: 1Gi
		    storage: 20Mi
		"

plain-carpenter-67621

06/28/2022, 3:45 PM

where can i get task defination from?

acceptable-policeman-57188

06/28/2022, 3:46 PM

I mean the task definition you write in flytekit aka your python code :)

acceptable-policeman-57188

06/28/2022, 3:46 PM

yeah the limit should not be less than the default... we should fix this. if you'd like to open an PR that would be awesome too 😄

👍 1

plain-carpenter-67621

06/28/2022, 3:48 PM

oh lol. yeah sure

Copy code

@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))
def split_traintest_dataset(
    dataset: FlyteFile[typing.TypeVar("parquet")], seed: int, test_split_ratio: float
) -> Tuple[
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
]:
    """
    Retrieves the training dataset from the given blob location and then splits it using the split ratio and returns the result
    """
    
    column_names = [k for k in NYC_DATASET_COLUMNS.keys()]
    try:
        df = pd.read_parquet("workflows/yellow_tripdata_2022-01.parquet")
        clean_df = preprocess(df)
    except Exception as err:
        print(err)
    # Select all features
    x = clean_df[column_names[:-1]]
    # Select only the classes
    y = clean_df[[column_names[-1]]]

    # split data into train and test sets
    return train_test_split(x, y, test_size=test_split_ratio, random_state=seed)

acceptable-policeman-57188

06/28/2022, 3:57 PM

there's nothing funky going on in preprocess that could possibly balloon the task's memory footprint?

plain-carpenter-67621

06/28/2022, 4:07 PM

it does not even reach till the preprocess step… and in the preprocess step its just removing nulls and selecting a set of columns..

thankful-minister-83577

06/28/2022, 5:16 PM

can you increase memory?

thankful-minister-83577

06/28/2022, 5:17 PM

i don’t know how big that file is but my 2010 file is pretty big

Copy code

-rw-r--r--@ 1 ytong  staff   2.5G Mar 24  2021 yellow_tripdata_2010-01.csv

thankful-minister-83577

06/28/2022, 5:18 PM

the entire file needs to be loaded into memory before the anything is done. flytekit is not smart enough yet to know how to stream-process from disk.

plain-carpenter-67621

06/28/2022, 5:32 PM

right.. but even on increasing it to this

Copy code

@task(requests=Resources(cpu="1", mem="2G", storage="500Mi"), limits=Resources(mem="5G", cpu="2", storage="500Mi"))

the pod crashes.. and i am using a single file

yellow_tripdata_2022-01.parquet

which is ~35Mb

tall-lock-23197

06/30/2022, 5:03 AM

Have you figured this out, or is the code not working yet? @plain-carpenter-67621 If you’re still seeing the issue, could you increase the storage mem and also set

ephemeral_storage

to say,

500Mi

? Not sure if that’d solve your problem, though.

plain-carpenter-67621

06/30/2022, 3:11 PM

i was able to solve the issue with Yee’s help. thanks

166 Views

Open in Slack

Previous Next