Hi team, suppose DataFrame is the task output of o...
# ask-the-community
b
Hi team, suppose DataFrame is the task output of one Flyte Spark task, will Flyte copy the full contents from DataFrame (may backed by some huge HDFS files) completely to s3 minio or just copy by reference? If former, it sounds somewhat inefficient 🤔
k
Between tasks it will be stored. Not to s3 necessarily- but could be in hdfc
The task boundary is fully recoverable
If you want to manage that - you can simply send references
b
Thank you Ketan for your prompt response!
The task boundary is fully recoverable
🤔not sure if i got it. i feel like
reference
is also recoverable? At least i find
FlyteSchema
is implemented by
reference to s3
.
k
what i mean is - if you return a
spark.dataframe
->
def foo() -> spark.dataframe
Flyte has to persist it, to make it possible to recover
if you return, a reference, then flyte will not try to persist it, it will assume you know what you are doing
all data is passed by reference, but the reference has to be created right?