Hi team, suppose DataFrame is the task output of one Flyte Spark task, will Flyte copy the full contents from DataFrame (may backed by some huge HDFS files) completely to s3 minio or just copy by reference?
If former, it sounds somewhat inefficient 🤔
f
freezing-airport-6809
07/27/2023, 3:07 AM
Between tasks it will be stored.
Not to s3 necessarily- but could be in hdfc
freezing-airport-6809
07/27/2023, 3:08 AM
The task boundary is fully recoverable
freezing-airport-6809
07/27/2023, 3:08 AM
If you want to manage that - you can simply send references
i
important-laptop-99340
07/27/2023, 3:26 AM
Thank you Ketan for your prompt response!
The task boundary is fully recoverable
🤔not sure if i got it. i feel like
reference
is also recoverable? At least i find
FlyteSchema
is implemented by
reference to s3
.
f
freezing-airport-6809
07/27/2023, 5:43 AM
what i mean is - if you return a
spark.dataframe
->
def foo() -> spark.dataframe
Flyte has to persist it, to make it possible to recov er
freezing-airport-6809
07/27/2023, 5:44 AM
if you return, a reference, then flyte will not try to persist it, it will assume you know what you are doing
freezing-airport-6809
07/27/2023, 5:44 AM
all data is passed by reference, but the reference has to be created right?