Hi team, suppose DataFrame is the task output of one Flyte Spark task, will Flyte copy the full cont...

important-laptop-99340

07/27/2023, 2:07 AM

Hi team, suppose DataFrame is the task output of one Flyte Spark task, will Flyte copy the full contents from DataFrame (may backed by some huge HDFS files) completely to s3 minio or just copy by reference? If former, it sounds somewhat inefficient 🤔

freezing-airport-6809

07/27/2023, 3:07 AM

Between tasks it will be stored. Not to s3 necessarily- but could be in hdfc

freezing-airport-6809

07/27/2023, 3:08 AM

The task boundary is fully recoverable

freezing-airport-6809

07/27/2023, 3:08 AM

If you want to manage that - you can simply send references

important-laptop-99340

07/27/2023, 3:26 AM

Thank you Ketan for your prompt response!

The task boundary is fully recoverable

🤔not sure if i got it. i feel like

reference

is also recoverable? At least i find

FlyteSchema

is implemented by

reference to s3

freezing-airport-6809

07/27/2023, 5:43 AM

what i mean is - if you return a

spark.dataframe

def foo() -> spark.dataframe

Flyte has to persist it, to make it possible to recov er

freezing-airport-6809

07/27/2023, 5:44 AM

if you return, a reference, then flyte will not try to persist it, it will assume you know what you are doing

freezing-airport-6809

07/27/2023, 5:44 AM

all data is passed by reference, but the reference has to be created right?

12 Views

Open in Slack

Previous Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.