https://flyte.org logo
#ask-the-community
Title
# ask-the-community
q

Quentin Chenevier

09/06/2023, 9:45 AM
Hello there 👋 , I've been testing Flyte since yesterday and it's very cool ! It's solving many of the issues I encountered as a data scientist / data engineer (on-demand infrastructure, versioning, etc.).... However, there is something I still don't understand and I'm a bit clueless about it even after having read the doc... What is the best (or at least "expected") pattern to access the data once you have computed it with Flyte ? Flyte is managing data in buckets using hashes, which is nice, but... is not very user-friendly. I've found you can use flyteremote to get the data you'd like. But: • is there an easy way to "feed" an artifact repository (a separate bucket) which would ease the access for non-Flyte users ? or is it an anti-pattern to do that ? • how can I remove old/unused artifacts from the Flyte bucket ? E.g all the artifacts which are not tagged with something like "release" ?
f

Franco Bocci

09/06/2023, 10:56 AM
Flyte stores data internally during the workflow execution. For other things (like serving a model, exploring a scored dataset, etc), I personally use another bucket or MLFlow, not the flyte one. If your tasks run using some an IAM Role (as an example, assuming you’re using AWS, but the concept translates to GCP), you can grant such role permission for a bucket created separately
my-own-bucket
, and then you can upload data from your Flyte workflow to such bucket
k

Ketan (kumare3)

09/06/2023, 1:44 PM
@Quentin Chenevier firstly thank you for joining the Community and sharing. You can change the output bucket for the data per workflow execution using the raw-output-prefix setting on an execution or launchplan or registration (cc @Samhita Alla ) Also you can use flyteremote to access data from a previous execution. On a side note we are working on a dataset / artifact service. You will see It sooon. We have plans for how it can help experimentation - model serving etc. let me know if you want to hop on a call
f

Franco Bocci

09/06/2023, 1:45 PM
One doubt from my side then, sorry if my suggestion was misleading! Is it okay to do this? Or would you suggest having an independent bucket for Flyte and a separate one for artifacts used somewhere else (until there is a dataset and artifact service)
k

Ketan (kumare3)

09/06/2023, 1:46 PM
Also @Franco Bocci if you or @Stephen are interested as well
@Franco Bocci I do not think your suggestion was misleading. This is how many folks work. We just are working to make this step unnecessary in our effort to reduce boilerplate and simplify management
q

Quentin Chenevier

09/06/2023, 8:12 PM
@Ketan (kumare3) Of course, I'd be interested in knowing more about the upcoming service you are working on. Sound great. 🙂 For the next few weeks, I'll try to get things going with a simple pattern to store artifacts and dataset (maybe MLflow, maybe something else).
k

Ketan (kumare3)

09/06/2023, 8:38 PM
@Quentin Chenevier infact the use of the intermediate data is to use like datasets / artifacts, albeit a little cumbersome to get to. The artifact service is like an indexing mechanism that makes it more powerful, but will not copy over the data
q

Quentin Chenevier

09/06/2023, 8:44 PM
Yes, using the flyte bucket as a warehouse and not copying the data would be awesome. Right now, I'm playing with flightremote to do this indexing (in a crappy notebook code). A question may arise pretty quickly though: what kind of retention policy ? Having a separate artifact storage makes it clear about what are the important data / results which shouldn't be deleted.
k

Ketan (kumare3)

09/06/2023, 9:05 PM
ya
q

Quentin Chenevier

09/06/2023, 9:16 PM
I'm dropping here the snippet I just did to list all the data/artifacts produced by the various workflows (note: for a sandbox):
Copy code
# %%
import pandas as pd
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config

# %%
flyteremote = FlyteRemote(config=Config.for_sandbox())
client = flyteremote.client

# %%
data = []
projects = client.list_projects().projects
for project in projects:
    for domain in project.domains:
        executions, _ = client.list_executions_paginated(
            project=project.id, domain=domain.id
        )
        for execution in executions:
            node_executions, _ = client.list_node_executions(workflow_execution_identifier=execution.id)  # type: ignore
            for node_execution in node_executions:
                node_execution_data = client.get_node_execution_data(node_execution.id)
                for k, v in node_execution_data.full_outputs.literals.items():
                    data.append(
                        dict(
                            project=execution.id.project,
                            domain=execution.id.domain,
                            execution_name=execution.id.name,
                            node_id=node_execution.id.node_id,
                            param_type="output",
                            param_name=k,
                            param_value=v.scalar.value,
                        )
                    )

df = pd.DataFrame(data)
df
@Ketan (kumare3) you said:
On a side note we are working on a dataset / artifact service. You will see It sooon.
Do you know what is the expected release date of this service ? Weeks, months or years ? I'm curious 😉
k

Ketan (kumare3)

09/07/2023, 11:05 PM
sadly open source is not yet planned, as open source need a lot more work to make it scalable easily deployable etc
we will probably launch it in Union Cloud first
q

Quentin Chenevier

09/08/2023, 7:11 AM
Ha I understand, thanks 🙂
k

Ketan (kumare3)

09/08/2023, 1:38 PM
Will definitely keep you posted and as you will see with the community our goal Is to make the most performant system available
3 Views