hello! i am trying to run a workflow which has a t...
# ask-the-community
n
hello! i am trying to run a workflow which has a task/step that fetches data from sandbox’s Minio and splits it into train/test. But the pod keeps dying with this error message
Copy code
Pod failed. No message received from kubernetes.
[atg6lnllwptbr4j6thwc-n0-0] terminated with exit code (137). Reason [Error]. Message: 

.
the flyte cluster is local sandbox cluster
k
could this be an OOM? do you see anything in the console for your execution? can you try configuring the task with resources and bumping the memory from the default?
n
yes, i did try maxing out the limits and requests for the failing task. Description of failed pod below.
Copy code
Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
also is
1Gi
the max limit i can set?
s
@Nada Saiyed, is that all you see in the pod log? Have you tried
kubectl describe po <pod-name> -n flytesnacks-development
?
n
yes. i have tried describing the pod, the description does not give any specific detail as to why the pod might be crashing 😕
k
is there anything in the console that indicates the failure reason? we do our best to interpret the pod status
can you share the task decorator with the increased resource requests and limits?
n
@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))
and no.. i dont see anything in the console that indicates failure reason besides the Failure msg i posted at the start
k
how much data is the task fetching? maybe 1Gi isn't enough?
n
hardly 40mb
how can i bump it to more than 1G?
k
you can update the limit in the config I shared above and then re-register the task with an increased limit in the decorator
but the pod shouldn't be OOMing really if it's 40mb. can you share your task definition?
n
in the config.. storage is set to
20Mi
that seems low
Copy code
"task_resource_defaults.yaml": "task_resources:
		  defaults:
		    cpu: 100m
		    memory: 500Mi
		    storage: 500Mi
		  limits:
		    cpu: 2
		    gpu: 1
		    memory: 1Gi
		    storage: 20Mi
		"
where can i get task defination from?
k
I mean the task definition you write in flytekit aka your python code :)
yeah the limit should not be less than the default... we should fix this. if you'd like to open an PR that would be awesome too 😄
👍 1
n
oh lol. yeah sure
Copy code
@task(requests=Resources(cpu="1", mem="900Mi"), limits=Resources(mem="1G", cpu="2"))
def split_traintest_dataset(
    dataset: FlyteFile[typing.TypeVar("parquet")], seed: int, test_split_ratio: float
) -> Tuple[
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_FEATURE_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
    FlyteSchema[NYC_CLASSES_COLUMNS],
]:
    """
    Retrieves the training dataset from the given blob location and then splits it using the split ratio and returns the result
    """
    
    column_names = [k for k in NYC_DATASET_COLUMNS.keys()]
    try:
        df = pd.read_parquet("workflows/yellow_tripdata_2022-01.parquet")
        clean_df = preprocess(df)
    except Exception as err:
        print(err)
    # Select all features
    x = clean_df[column_names[:-1]]
    # Select only the classes
    y = clean_df[[column_names[-1]]]

    # split data into train and test sets
    return train_test_split(x, y, test_size=test_split_ratio, random_state=seed)
k
there's nothing funky going on in preprocess that could possibly balloon the task's memory footprint?
n
it does not even reach till the preprocess step… and in the preprocess step its just removing nulls and selecting a set of columns..
y
can you increase memory?
i don’t know how big that file is but my 2010 file is pretty big
Copy code
-rw-r--r--@ 1 ytong  staff   2.5G Mar 24  2021 yellow_tripdata_2010-01.csv
the entire file needs to be loaded into memory before the anything is done. flytekit is not smart enough yet to know how to stream-process from disk.
n
right.. but even on increasing it to this
Copy code
@task(requests=Resources(cpu="1", mem="2G", storage="500Mi"), limits=Resources(mem="5G", cpu="2", storage="500Mi"))
the pod crashes.. and i am using a single file
yellow_tripdata_2022-01.parquet
which is ~35Mb
s
Have you figured this out, or is the code not working yet? @Nada Saiyed If you’re still seeing the issue, could you increase the storage mem and also set
ephemeral_storage
to say,
500Mi
? Not sure if that’d solve your problem, though.
n
i was able to solve the issue with Yee’s help. thanks
164 Views