Hi, everyone. I have a naive question: How do I lo...
# ask-the-community
t
Hi, everyone. I have a naive question: How do I load a file from a GCS bucket inside a spark_task? Is there some idiomatic ways to do it in Flyte? (Normally I would add some maven coordinates to my spark config, but I noticed that this does not work for Flyte in a cluster where spark images has to be pre-built.)
y
In Flyte a task can use file type for input/output, Flyte uses s3/gcs for persistent storage. If that file return by a task in flyte then you can pass that file to your spark task as input, You don’t need to deal with anything. If that file is not generated return by flyte, Then you need to pass the file path url in task and task will use gcs client to download the file. https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/files.html
👍 1
cc: @Samhita Alla
t
so if I understand it correctly, I should be able to read a gcs file in flyte like this:
Copy code
@task
def get_file(url) -> FlyteFile:
    return FlyteFile(url)

@task(task_config=Spark(...))
def spark_task(file: FlyteFile):
    spark = flytekit.current_session().spark_session
    df = spark.read.text(file)

@workflow
def pipeline(url="<gs://xxxx>"):
    file = get_file(url)
    spark_task(file)
Is this the right way to do it?
🙌 1
y
yes
You don’t need any logic for s3/gcs
t
Do i need to an one more step for FlyteFile(url).download()
y
no 😁
t
That sounds great! Thx a lot.
I tried with the above approach, but got this error in `spark_task`:
'FlyteFile' object has no attribute '_get_object_id'
y
sorry my bad, You need
file.download()
, @Samhita Alla Can you confirm the logic ?
s
Oh yes.
file.download()
is required.
t
So the handle returned from
file.download()
should be feed into
spark.read.text(..)
right?
k
You do not need file.download for spark you want to Hanover the remotepath to the executors
Each file input should have an attribute ‘.remote_path’
@Tiansu Yu
You can ofcourse download, but in spark that would be weird,
Cc @Yee / @Samhita Alla we should probably create a derivative type called remote file, that is never downloaded - but only a remote path
t
Yes, but let me be clear here: what do you mean by a remote_path, if not the original url? But if you dont have a gcs_connector on your executor, then how does spark read it anyway, so I dont get the point of wrapping my url inside a FlyteFile then return it back again.
k
@Tiansu Yu I can probably help - but I am in PDT, so in 2 ish hours
Flyte file- advantage is automatic upload, mostly meant for non spark
Sttuctureddataset on the other hand is handled for spark
But point noted, we should support file on spark too
So flytefile is like a persistent file interface, when you work with it, it is materialized locally, and when you return it, it is automatically uploaded/ sent. You don't have to use it, you can use strings
t
Thx for these points.
k
Definitely read the docs on both
One benefit of flytefile is native understanding by Ui/ backend/ ctl etc
170 Views