Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi team, suppose DataFrame is the task output of one Flyte Spark task, will Flyte copy the full contents from DataFrame (may backed by some huge HDFS files) completely to s3 minio or just copy by reference?

If former, it sounds somewhat inefficient :thinking_face: 

Between tasks it will be stored.
Not to s3 necessarily- but could be in hdfc

If you want to manage that - you can simply send references 

image.png

Thank you Ketan for your prompt response!

&gt; The task boundary is fully recoverable
:thinking_face:not sure if i got it. i feel like `reference` is also recoverable? At least i find `FlyteSchema` is implemented by `reference to s3` .

what i mean is - if you return a `spark.dataframe` -&gt; `def foo() -&gt; spark.dataframe` Flyte has to persist it, to make it possible to recov er

if you return, a reference, then flyte will not try to persist it, it will assume you know what you are doing

all data is passed by reference, but the reference has to be created right?