:wave: hello all! It seems ETL has been a topic t...
# ask-the-community
t
đź‘‹ hello all! It seems ETL has been a topic that has come up a few times and I'm bringing it up again. Some of the pipelines we'd like to move to Flyte are run-of-the-mill ETL. For example: grab all objects added to object storage in the last hour, turn them into a single
jsonl.gz
file, and write it to object storage in a different cloud account. I can accomplish this with custom tasks that connect to the source and target object storage systems, but that seems like a lot of boilerplate. It sure seems like Flyte is not designed for these types of ETL efforts and a different tool should be used for those. Am I reading the situation right?
k
Cc @Yee / @Kevin Su
It has sensors now and you can use events to trigger Flyte
There is a blog on this
Also you can ofcourse run something on a schedule - why do you say it is not designed for this
t
Well, I should say that it seems like it wasn't designed for it. But I could certainly be missing the boat. Still quite new to Flyte. The main reason I'm thinking this way comes down to not seeing anything like "source", "destination", or "connection" definitions like you see in tools like Airflow/AirByte. So, for the task I referenced above, say the source objects are in an Azure blob container and the target object should be written to an AWS S3 bucket. I think I would have to write code that.. 1. Lists keys in the Azure provider 2. Writes the
jsonl.gz
file to a local file (maybe the Flyte data 3. Uploads the data to the s3 bucket All this would work fine, of course, but that's the boilerplate I'm referring too. Since neither the source nor destination would be the FlyteObject store configured for the environment, I think we'd need to use some 3rd party SDKs to interact with the object stores.
And, just for clarity sake - are you referring to this blog?
k
Also you wouldn’t need third party asks, flytefile will work directly - this is controlled by raw data prefix config that can be applied at an execution level
t
Gotcha, so
FlyteFile
/
FlyteDirectory
are good fits for that type of activity - that's great
k
Yup, let me share an example
@Kevin Su can you type pyflyte run with setting output prefix
Also @Terence Kent sorry if you found me rude. I was asking more to understand how can we improve the docs to explain this further
t
(didn't take it that way at all - but appreciate being careful)
k
are you looking for sensor? you could also check out airflow agent, or fsspec sensor. here is an example
Copy code
from flytekit.sensor.file_sensor import FileSensor
from flytekit import task, workflow

sensor = FileSensor(name="test_sensor")
@task()
def t1():
    print("flyte")

@workflow
def wf():
    sensor(tmp_file.name) >> t1()
t
@Kevin Su - Thanks! I think I got the hint that I needed (use
FlyteFile
and
FlyteDirectory
to interact with external object stores).
k
Yup and flytefile / Flyte directory / structured dataset can have multiple protocols like s3 , file://, sftp etc all controlled by fsspec
y
@Terence Kent i think all those things you mentioned are doable in flyte with equal or less boilerplate than any other system actually. if you find yourself writing lots of boilerplate, post in the channel and we’ll see if we can’t reduce it. the only thing that you might find unfamiliar the is that flyte does not force a time dimension like airflow does. but you can add one and link it to a schedule using a kickoff time arg.