Hello there :wave: , I've been testing Flyte since...
# flyte-support
r
Hello there πŸ‘‹ , I've been testing Flyte since yesterday and it's very cool ! It's solving many of the issues I encountered as a data scientist / data engineer (on-demand infrastructure, versioning, etc.).... However, there is something I still don't understand and I'm a bit clueless about it even after having read the doc... What is the best (or at least "expected") pattern to access the data once you have computed it with Flyte ? Flyte is managing data in buckets using hashes, which is nice, but... is not very user-friendly. I've found you can use flyteremote to get the data you'd like. But: β€’ is there an easy way to "feed" an artifact repository (a separate bucket) which would ease the access for non-Flyte users ? or is it an anti-pattern to do that ? β€’ how can I remove old/unused artifacts from the Flyte bucket ? E.g all the artifacts which are not tagged with something like "release" ?
t
Flyte stores data internally during the workflow execution. For other things (like serving a model, exploring a scored dataset, etc), I personally use another bucket or MLFlow, not the flyte one. If your tasks run using some an IAM Role (as an example, assuming you’re using AWS, but the concept translates to GCP), you can grant such role permission for a bucket created separately
my-own-bucket
, and then you can upload data from your Flyte workflow to such bucket
f
@rough-sugar-4818 firstly thank you for joining the Community and sharing. You can change the output bucket for the data per workflow execution using the raw-output-prefix setting on an execution or launchplan or registration (cc @tall-lock-23197 ) Also you can use flyteremote to access data from a previous execution. On a side note we are working on a dataset / artifact service. You will see It sooon. We have plans for how it can help experimentation - model serving etc. let me know if you want to hop on a call
t
One doubt from my side then, sorry if my suggestion was misleading! Is it okay to do this? Or would you suggest having an independent bucket for Flyte and a separate one for artifacts used somewhere else (until there is a dataset and artifact service)
f
Also @thankful-tailor-28399 if you or @jolly-whale-9142 are interested as well
βž• 1
@thankful-tailor-28399 I do not think your suggestion was misleading. This is how many folks work. We just are working to make this step unnecessary in our effort to reduce boilerplate and simplify management
πŸ‘ 1
r
Thank you @thankful-tailor-28399 and @freezing-airport-6809 for your kind answers ! If I understand well, the flyte storage main use is to support the cache and it is not intended to be used as a "dataset / artifact service". For this, the user has to choose another tool (like MLflow as you suggest @thankful-tailor-28399)
@freezing-airport-6809 Of course, I'd be interested in knowing more about the upcoming service you are working on. Sound great. πŸ™‚ For the next few weeks, I'll try to get things going with a simple pattern to store artifacts and dataset (maybe MLflow, maybe something else).
f
@rough-sugar-4818 infact the use of the intermediate data is to use like datasets / artifacts, albeit a little cumbersome to get to. The artifact service is like an indexing mechanism that makes it more powerful, but will not copy over the data
r
Yes, using the flyte bucket as a warehouse and not copying the data would be awesome. Right now, I'm playing with flightremote to do this indexing (in a crappy notebook code). A question may arise pretty quickly though: what kind of retention policy ? Having a separate artifact storage makes it clear about what are the important data / results which shouldn't be deleted.
f
ya
r
I'm dropping here the snippet I just did to list all the data/artifacts produced by the various workflows (note: for a sandbox):
Copy code
# %%
import pandas as pd
from flytekit.remote import FlyteRemote
from flytekit.configuration import Config

# %%
flyteremote = FlyteRemote(config=Config.for_sandbox())
client = flyteremote.client

# %%
data = []
projects = client.list_projects().projects
for project in projects:
    for domain in project.domains:
        executions, _ = client.list_executions_paginated(
            project=project.id, domain=domain.id
        )
        for execution in executions:
            node_executions, _ = client.list_node_executions(workflow_execution_identifier=execution.id)  # type: ignore
            for node_execution in node_executions:
                node_execution_data = client.get_node_execution_data(node_execution.id)
                for k, v in node_execution_data.full_outputs.literals.items():
                    data.append(
                        dict(
                            project=execution.id.project,
                            domain=execution.id.domain,
                            execution_name=execution.id.name,
                            node_id=node_execution.id.node_id,
                            param_type="output",
                            param_name=k,
                            param_value=v.scalar.value,
                        )
                    )

df = pd.DataFrame(data)
df
@freezing-airport-6809 you said:
On a side note we are working on a dataset / artifact service. You will see It sooon.
Do you know what is the expected release date of this service ? Weeks, months or years ? I'm curious πŸ˜‰
f
sadly open source is not yet planned, as open source need a lot more work to make it scalable easily deployable etc
we will probably launch it in Union Cloud first
r
Ha I understand, thanks πŸ™‚
f
Will definitely keep you posted and as you will see with the community our goal Is to make the most performant system available