Hi, everyone. I have a naive question: How do I lo...
# ask-the-community
Hi, everyone. I have a naive question: How do I load a file from a GCS bucket inside a spark_task? Is there some idiomatic ways to do it in Flyte? (Normally I would add some maven coordinates to my spark config, but I noticed that this does not work for Flyte in a cluster where spark images has to be pre-built.)
In Flyte a task can use file type for input/output, Flyte uses s3/gcs for persistent storage. If that file return by a task in flyte then you can pass that file to your spark task as input, You don’t need to deal with anything. If that file is not generated return by flyte, Then you need to pass the file path url in task and task will use gcs client to download the file. https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/files.html
👍 1
cc: @Samhita Alla
so if I understand it correctly, I should be able to read a gcs file in flyte like this:
Copy code
def get_file(url) -> FlyteFile:
    return FlyteFile(url)

def spark_task(file: FlyteFile):
    spark = flytekit.current_session().spark_session
    df = spark.read.text(file)

def pipeline(url="<gs://xxxx>"):
    file = get_file(url)
Is this the right way to do it?
🙌 1
You don’t need any logic for s3/gcs
Do i need to an one more step for FlyteFile(url).download()
no 😁
That sounds great! Thx a lot.
I tried with the above approach, but got this error in `spark_task`:
'FlyteFile' object has no attribute '_get_object_id'
sorry my bad, You need
, @Samhita Alla Can you confirm the logic ?
Oh yes.
is required.
So the handle returned from
should be feed into
You do not need file.download for spark you want to Hanover the remotepath to the executors
Each file input should have an attribute ‘.remote_path’
@Tiansu Yu
You can ofcourse download, but in spark that would be weird,
Cc @Yee / @Samhita Alla we should probably create a derivative type called remote file, that is never downloaded - but only a remote path
Yes, but let me be clear here: what do you mean by a remote_path, if not the original url? But if you dont have a gcs_connector on your executor, then how does spark read it anyway, so I dont get the point of wrapping my url inside a FlyteFile then return it back again.
@Tiansu Yu I can probably help - but I am in PDT, so in 2 ish hours
Flyte file- advantage is automatic upload, mostly meant for non spark
Sttuctureddataset on the other hand is handled for spark
But point noted, we should support file on spark too
So flytefile is like a persistent file interface, when you work with it, it is materialized locally, and when you return it, it is automatically uploaded/ sent. You don't have to use it, you can use strings
Thx for these points.
Definitely read the docs on both
One benefit of flytefile is native understanding by Ui/ backend/ ctl etc