wave hello all It seems ETL has been a topic that has come Flyte #flyte-support

:wave: hello all! It seems ETL has been a topic t...

gray-vr-17331

09/01/2023, 12:18 AM

👋 hello all! It seems ETL has been a topic that has come up a few times and I'm bringing it up again. Some of the pipelines we'd like to move to Flyte are run-of-the-mill ETL. For example: grab all objects added to object storage in the last hour, turn them into a single

jsonl.gz

file, and write it to object storage in a different cloud account. I can accomplish this with custom tasks that connect to the source and target object storage systems, but that seems like a lot of boilerplate. It sure seems like Flyte is not designed for these types of ETL efforts and a different tool should be used for those. Am I reading the situation right?

freezing-airport-6809

09/01/2023, 1:23 AM

Cc @thankful-minister-83577 / @glamorous-carpet-83516

freezing-airport-6809

09/01/2023, 2:28 AM

It has sensors now and you can use events to trigger Flyte

freezing-airport-6809

09/01/2023, 2:28 AM

There is a blog on this

freezing-airport-6809

09/01/2023, 2:29 AM

Also you can ofcourse run something on a schedule - why do you say it is not designed for this

gray-vr-17331

09/01/2023, 2:56 AM

Well, I should say that it seems like it wasn't designed for it. But I could certainly be missing the boat. Still quite new to Flyte. The main reason I'm thinking this way comes down to not seeing anything like "source", "destination", or "connection" definitions like you see in tools like Airflow/AirByte. So, for the task I referenced above, say the source objects are in an Azure blob container and the target object should be written to an AWS S3 bucket. I think I would have to write code that.. 1. Lists keys in the Azure provider 2. Writes the

jsonl.gz

file to a local file (maybe the Flyte data 3. Uploads the data to the s3 bucket All this would work fine, of course, but that's the boilerplate I'm referring too. Since neither the source nor destination would be the FlyteObject store configured for the environment, I think we'd need to use some 3rd party SDKs to interact with the object stores.

gray-vr-17331

09/01/2023, 2:58 AM

And, just for clarity sake - are you referring to this blog?

freezing-airport-6809

09/01/2023, 3:00 AM

Not that blog I meant this one https://flyte.org/blog/build-an-event-driven-neural-style-transfer-application-using-aws-lambda

👍 1

freezing-airport-6809

09/01/2023, 3:01 AM

Also you wouldn’t need third party asks, flytefile will work directly - this is controlled by raw data prefix config that can be applied at an execution level

👍 1

gray-vr-17331

09/01/2023, 3:03 AM

Gotcha, so

FlyteFile

FlyteDirectory

are good fits for that type of activity - that's great

freezing-airport-6809

09/01/2023, 3:11 AM

Yup, let me share an example

freezing-airport-6809

09/01/2023, 3:11 AM

@glamorous-carpet-83516 can you type pyflyte run with setting output prefix

freezing-airport-6809

09/01/2023, 3:12 AM

Also @gray-vr-17331 sorry if you found me rude. I was asking more to understand how can we improve the docs to explain this further

gray-vr-17331

09/01/2023, 3:12 AM

(didn't take it that way at all - but appreciate being careful)

❤️ 1

glamorous-carpet-83516

09/01/2023, 3:13 AM

are you looking for sensor? you could also check out airflow agent, or fsspec sensor. here is an example

Copy code

from flytekit.sensor.file_sensor import FileSensor
from flytekit import task, workflow

sensor = FileSensor(name="test_sensor")
@task()
def t1():
    print("flyte")

@workflow
def wf():
    sensor(tmp_file.name) >> t1()

👍 2

gray-vr-17331

09/01/2023, 3:36 AM

@glamorous-carpet-83516 - Thanks! I think I got the hint that I needed (use

FlyteFile

and

FlyteDirectory

to interact with external object stores).

freezing-airport-6809

09/01/2023, 3:40 AM

Yup and flytefile / Flyte directory / structured dataset can have multiple protocols like s3 , file://, sftp etc all controlled by fsspec

thankful-minister-83577

09/01/2023, 4:06 PM

@gray-vr-17331 i think all those things you mentioned are doable in flyte with equal or less boilerplate than any other system actually. if you find yourself writing lots of boilerplate, post in the channel and we’ll see if we can’t reduce it. the only thing that you might find unfamiliar the is that flyte does not force a time dimension like airflow does. but you can add one and link it to a schedule using a kickoff time arg.

👍 1

8 Views

Open in Slack

Previous Next