HI everyone What is the best way to allow a failed task to s Flyte #flyte-support

HI everyone. What is the best way to allow a faile...

quick-helicopter-88984

08/22/2024, 6:03 PM

HI everyone. What is the best way to allow a failed task to still produce some output for the user. Here is a real use case: Let's say we have a task for training a model. If training fails, I prefer that the user sees this as a failed task. But I still want to return some information to the user (e.g., the location of all the checkpoints created so far). I know that there is intratask checkpointing but that is useful for passing info to the next run that is triggered by recover button. In my case, I want to pass richer info to the user so that they can decide on their next step (whether to retry using a previous checkpoint or run from scratch, see all the checkpoints and decide which ones to use, etc).

quick-helicopter-88984

08/22/2024, 6:05 PM

In my past job, we had an in-house orchestrator that had the concept of side artifacts. These artifacts could be streamed while the task was running (as opposed to a normal artifact that would be created only upon successful completion of the task). AFAICT, Flyte doesn't have this concept (unless you have some sort of a hidden contract with the user and place your side artifacts in some directory). Is there a clean and type-safe way to produce some output even when the task fails?

average-finland-92144

08/22/2024, 10:40 PM

@quick-helicopter-88984 does Failure node match what you're looking for?

quick-helicopter-88984

08/22/2024, 11:19 PM

@average-finland-92144 Thanks for the response but it doesn't exactly solve my problem. My team is an infra team and we creates tasks that are used by many of our users. We control the task (e.g., the trainer task) and our users use them in their own workflow (which we don't control). So we want this logic to be encapsulated in the task itself. Kind of like a regular function that can return information both on success (through ret value) and on failure (through an exception or error code). In our case, the info passed on failuyre is the location of checkpoints

quick-helicopter-88984

08/22/2024, 11:20 PM

Although reading the docs again, I think if we convert our task to a subworkflow, it can do this

average-finland-92144

08/22/2024, 11:26 PM

Be aware that if the parent workflow fails, so will the subworkflow. I'll try to loop in someone with better insights to help you

quick-helicopter-88984

08/22/2024, 11:27 PM

Thank you so much

glamorous-carpet-83516

08/23/2024, 12:08 AM

> I prefer that the user sees this as a failed task. But I still want to return some information to the user how should we pass the information to your users in this case? or let’s say we save this information somewhere in the backend. how do your users get this information. or they just want to see the info in the UI

freezing-airport-6809

08/23/2024, 2:39 AM

Actually the location of checkpoints are already stored in the db, we currently do not show in the UI (if you are using the intra task checkpoint paths)

high-accountant-32689

08/23/2024, 3:41 PM

@quick-helicopter-88984, can you comment a bit more on how users could reuse checkpoints in your scenario? Let's say a task fails and we were able to bubble up the intra-task checkpoint address (as a property of the failed task), how are users supposed to use this in the executions of their workflows? Unfortunately we don't provide (yet?) a mechanism to inject checkpoints into task executions, but for the purposes of this exercise I just want to understand a bit more how this info could be used.

quick-helicopter-88984

08/27/2024, 3:22 PM

Hi . Sorry for the delay

quick-helicopter-88984

08/27/2024, 3:25 PM

@high-accountant-32689 @freezing-airport-6809 Here is one way I have seen it with another orchestrator: They had a concept called "side artifacts". It allowed a task to stream some output defined in the contract of the task. A task could produce these side artifacts while running (before being fully done). There was a constraint that you could not pass these side artifacts to downstream tasks (for obvious reasons). In that set up, a task like a model training task would have some regular output artifacts (trained model, final eval results) and some side artifacts (a stream/list of checkpoints and a stream/list of model eval results to be passed to TensorBoard)

quick-helicopter-88984

08/27/2024, 3:26 PM

This allowed two things:

quick-helicopter-88984

08/27/2024, 3:28 PM

• Showing your model eval results (which were created periodiclaly during training) on Tensorboard without relying on a hidden contract (e.g., a set location on disk or the trainer task making an RPC call to some service which is a kind of side effect) • If training failed, the user could find the set of checkpoints and then rerun the pipeline passing that checkpoint as an input to the trainer (the trainer task has an optional input called "warm start" that would take a checkpoint)

quick-helicopter-88984

08/27/2024, 3:29 PM

In general, I think supporting the notion of a side artifact is worth considering both for dealing with failures and for allowing folks to replace side effects of a task with something that gets defined in the interface (contract) of the task

freezing-airport-6809

08/28/2024, 3:20 PM

you can stream data out of a task, it will not be commited

freezing-airport-6809

08/28/2024, 3:21 PM

we are working on realtime decks that will allow for streaming viz out of tasks

freezing-airport-6809

08/28/2024, 3:21 PM

cc @silly-toddler-37820

quick-helicopter-88984

08/28/2024, 3:21 PM

That's great. Do you have any docs on the current streaming capabilities?

freezing-airport-6809

08/28/2024, 3:22 PM

you can always write all outputs as a stream

quick-helicopter-88984

08/28/2024, 3:23 PM

But can you make the location of that stream part of your interface? I want to avoid an implicit contract with the user of the task I am creating. Basically, I want the task itself to tell the user of the task where the data will be streamed to

freezing-airport-6809

08/28/2024, 3:23 PM

Example - https://docs.union.ai/byoc/data-input-output/flyte-file-and-flyte-directory#streaming

freezing-airport-6809

08/28/2024, 3:24 PM

yes stream is part of the interface

quick-helicopter-88984

08/28/2024, 3:24 PM

Yes, but if the task fails, there is going to be no output

quick-helicopter-88984

08/28/2024, 3:24 PM

(output as FlyteFile)

freezing-airport-6809

08/28/2024, 3:24 PM

when you return outputs of a task for example a file, the only thing returned is the location

quick-helicopter-88984

08/28/2024, 3:24 PM

So this only works when te task succeeds

freezing-airport-6809

08/28/2024, 3:24 PM

so you want to received the outputs even on failure right

freezing-airport-6809

08/28/2024, 3:24 PM

ya this is not supported today, but with streaming decks this will work

freezing-airport-6809

08/28/2024, 3:24 PM

let me share

quick-helicopter-88984

08/28/2024, 3:25 PM

Yes. For cases like model checkpoints

quick-helicopter-88984

08/28/2024, 3:25 PM

That's awesome. Looking forward to the streaming decks feature 🙂

freezing-airport-6809

08/28/2024, 3:25 PM

https://github.com/flyteorg/flyte/issues/5574

quick-helicopter-88984

08/28/2024, 3:25 PM

🙏

freezing-airport-6809

08/28/2024, 3:25 PM

please upvote

88 Views

Open in Slack

Previous Next