Hi everyone I have a naive question How do I load a file fro Flyte #flyte-support

Hi, everyone. I have a naive question: How do I lo...

better-toddler-72553

04/07/2022, 9:39 AM

Hi, everyone. I have a naive question: How do I load a file from a GCS bucket inside a spark_task? Is there some idiomatic ways to do it in Flyte? (Normally I would add some maven coordinates to my spark config, but I noticed that this does not work for Flyte in a cluster where spark images has to be pre-built.)

great-school-54368

04/07/2022, 10:11 AM

In Flyte a task can use file type for input/output, Flyte uses s3/gcs for persistent storage. If that file return by a task in flyte then you can pass that file to your spark task as input, You don’t need to deal with anything. If that file is not generated return by flyte, Then you need to pass the file path url in task and task will use gcs client to download the file. https://docs.flyte.org/projects/cookbook/en/latest/auto/core/flyte_basics/files.html

👍 1

great-school-54368

04/07/2022, 10:11 AM

cc: @tall-lock-23197

better-toddler-72553

04/07/2022, 12:29 PM

so if I understand it correctly, I should be able to read a gcs file in flyte like this:

Copy code

@task
def get_file(url) -> FlyteFile:
    return FlyteFile(url)

@task(task_config=Spark(...))
def spark_task(file: FlyteFile):
    spark = flytekit.current_session().spark_session
    df = spark.read.text(file)

@workflow
def pipeline(url="<gs://xxxx>"):
    file = get_file(url)
    spark_task(file)

Is this the right way to do it?

🙌 1

great-school-54368

04/07/2022, 12:34 PM

yes

great-school-54368

04/07/2022, 12:35 PM

You don’t need any logic for s3/gcs

better-toddler-72553

04/07/2022, 12:37 PM

Do i need to an one more step for FlyteFile(url).download()

great-school-54368

04/07/2022, 12:37 PM

no 😁

better-toddler-72553

04/07/2022, 12:38 PM

That sounds great! Thx a lot.

better-toddler-72553

04/07/2022, 1:07 PM

I tried with the above approach, but got this error in `spark_task`:

'FlyteFile' object has no attribute '_get_object_id'

great-school-54368

04/07/2022, 1:20 PM

sorry my bad, You need

file.download()

, @tall-lock-23197 Can you confirm the logic ?

tall-lock-23197

04/07/2022, 1:22 PM

Oh yes.

file.download()

is required.

better-toddler-72553

04/07/2022, 1:46 PM

So the handle returned from

file.download()

should be feed into

spark.read.text(..)

right?

great-school-54368

04/07/2022, 1:54 PM

Check one example https://github.com/flyteorg/flytesnacks/blob/master/cookbook/core/flyte_basics/files.py#L40

freezing-airport-6809

04/07/2022, 1:56 PM

Also docs on internals here https://docs.flyte.org/en/latest/concepts/data_management.html#divedeep-data-management

freezing-airport-6809

04/07/2022, 1:57 PM

You do not need file.download for spark you want to Hanover the remotepath to the executors

freezing-airport-6809

04/07/2022, 1:58 PM

Each file input should have an attribute ‘.remote_path’

freezing-airport-6809

04/07/2022, 1:58 PM

@better-toddler-72553

freezing-airport-6809

04/07/2022, 1:58 PM

You can ofcourse download, but in spark that would be weird,

freezing-airport-6809

04/07/2022, 1:59 PM

Cc @thankful-minister-83577 / @tall-lock-23197 we should probably create a derivative type called remote file, that is never downloaded - but only a remote path

better-toddler-72553

04/07/2022, 2:01 PM

Yes, but let me be clear here: what do you mean by a remote_path, if not the original url? But if you dont have a gcs_connector on your executor, then how does spark read it anyway, so I dont get the point of wrapping my url inside a FlyteFile then return it back again.

freezing-airport-6809

04/07/2022, 2:01 PM

@better-toddler-72553 I can probably help - but I am in PDT, so in 2 ish hours

freezing-airport-6809

04/07/2022, 2:02 PM

Flyte file- advantage is automatic upload, mostly meant for non spark

freezing-airport-6809

04/07/2022, 2:02 PM

Sttuctureddataset on the other hand is handled for spark

freezing-airport-6809

04/07/2022, 2:03 PM

But point noted, we should support file on spark too

freezing-airport-6809

04/07/2022, 2:04 PM

So flytefile is like a persistent file interface, when you work with it, it is materialized locally, and when you return it, it is automatically uploaded/ sent. You don't have to use it, you can use strings

better-toddler-72553

04/07/2022, 2:08 PM

Thx for these points.

freezing-airport-6809

04/07/2022, 2:08 PM

Definitely read the docs on both

freezing-airport-6809

04/07/2022, 2:09 PM

One benefit of flytefile is native understanding by Ui/ backend/ ctl etc

176 Views

Open in Slack

Previous Next