https://flyte.org logo
n

Nada Saiyed

06/27/2022, 9:11 PM
hello! i am trying to run a workflow which has a task/step that fetches data from sandbox’s Minio and splits it into train/test. But the pod keeps dying with this error message
Copy code
Pod failed. No message received from kubernetes.
[atg6lnllwptbr4j6thwc-n0-0] terminated with exit code (137). Reason [Error]. Message: 

.
the flyte cluster is local sandbox cluster
k

katrina

06/27/2022, 11:02 PM
could this be an OOM? do you see anything in the console for your execution? can you try configuring the task with resources and bumping the memory from the default?
n

Nada Saiyed

06/28/2022, 2:03 AM
yes, i did try maxing out the limits and requests for the failing task. Description of failed pod below.
Copy code
Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
also is
1Gi
the max limit i can set?
s

Samhita Alla

06/28/2022, 4:17 AM
@Nada Saiyed, is that all you see in the pod log? Have you tried
kubectl describe po <pod-name> -n flytesnacks-development
?
n

Nada Saiyed

06/28/2022, 4:40 AM
yes. i have tried describing the pod, the description does not give any specific detail as to why the pod might be crashing 😕
k

katrina

06/28/2022, 3:35 PM
is there anything in the console that indicates the failure reason? we do our best to interpret the pod status
can you share the task decorator with the increased resource requests and limits?
n

Nada Saiyed

06/28/2022, 3:37 PM
@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))
and no.. i dont see anything in the console that indicates failure reason besides the Failure msg i posted at the start
k

katrina

06/28/2022, 3:40 PM
how much data is the task fetching? maybe 1Gi isn't enough?
n

Nada Saiyed

06/28/2022, 3:41 PM
hardly 40mb
how can i bump it to more than 1G?
k

katrina

06/28/2022, 3:42 PM
you can update the limit in the config I shared above and then re-register the task with an increased limit in the decorator
but the pod shouldn't be OOMing really if it's 40mb. can you share your task definition?
n

Nada Saiyed

06/28/2022, 3:43 PM
in the config.. storage is set to
20Mi
that seems low
Copy code
"task_resource_defaults.yaml": "task_resources:
		  defaults:
		    cpu: 100m
		    memory: 500Mi
		    storage: 500Mi
		  limits:
		    cpu: 2
		    gpu: 1
		    memory: 1Gi
		    storage: 20Mi
		"
where can i get task defination from?
k

katrina

06/28/2022, 3:46 PM
I mean the task definition you write in flytekit aka your python code :)
yeah the limit should not be less than the default... we should fix this. if you'd like to open an PR that would be awesome too 😄
👍 1
n

Nada Saiyed

06/28/2022, 3:48 PM
oh lol. yeah sure
Copy code
@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))
def split_traintest_dataset(
    dataset: FlyteFile[typing.TypeVar("parquet")], seed: int, test_split_ratio: float
) -> Tuple[
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
]:
    """
    Retrieves the training dataset from the given blob location and then splits it using the split ratio and returns the result
    """
    
    column_names = [k for k in NYC_DATASET_COLUMNS.keys()]
    try:
        df = pd.read_parquet("workflows/yellow_tripdata_2022-01.parquet")
        clean_df = preprocess(df)
    except Exception as err:
        print(err)
    # Select all features
    x = clean_df[column_names[:-1]]
    # Select only the classes
    y = clean_df[[column_names[-1]]]

    # split data into train and test sets
    return train_test_split(x, y, test_size=test_split_ratio, random_state=seed)
k

katrina

06/28/2022, 3:57 PM
there's nothing funky going on in preprocess that could possibly balloon the task's memory footprint?
n

Nada Saiyed

06/28/2022, 4:07 PM
it does not even reach till the preprocess step… and in the preprocess step its just removing nulls and selecting a set of columns..
y

Yee

06/28/2022, 5:16 PM
can you increase memory?
i don’t know how big that file is but my 2010 file is pretty big
Copy code
-rw-r--r--@ 1 ytong  staff   2.5G Mar 24  2021 yellow_tripdata_2010-01.csv
the entire file needs to be loaded into memory before the anything is done. flytekit is not smart enough yet to know how to stream-process from disk.
n

Nada Saiyed

06/28/2022, 5:32 PM
right.. but even on increasing it to this
Copy code
@task(requests=Resources(cpu="1", mem="2G", storage="500Mi"), limits=Resources(mem="5G", cpu="2", storage="500Mi"))
the pod crashes.. and i am using a single file
yellow_tripdata_2022-01.parquet
which is ~35Mb
s

Samhita Alla

06/30/2022, 5:03 AM
Have you figured this out, or is the code not working yet? @Nada Saiyed If you’re still seeing the issue, could you increase the storage mem and also set
ephemeral_storage
to say,
500Mi
? Not sure if that’d solve your problem, though.
n

Nada Saiyed

06/30/2022, 3:11 PM
i was able to solve the issue with Yee’s help. thanks
7 Views