HI everyone. What is the best way to allow a faile...
# flyte-support
q
HI everyone. What is the best way to allow a failed task to still produce some output for the user. Here is a real use case: Let's say we have a task for training a model. If training fails, I prefer that the user sees this as a failed task. But I still want to return some information to the user (e.g., the location of all the checkpoints created so far). I know that there is intratask checkpointing but that is useful for passing info to the next run that is triggered by recover button. In my case, I want to pass richer info to the user so that they can decide on their next step (whether to retry using a previous checkpoint or run from scratch, see all the checkpoints and decide which ones to use, etc).
In my past job, we had an in-house orchestrator that had the concept of side artifacts. These artifacts could be streamed while the task was running (as opposed to a normal artifact that would be created only upon successful completion of the task). AFAICT, Flyte doesn't have this concept (unless you have some sort of a hidden contract with the user and place your side artifacts in some directory). Is there a clean and type-safe way to produce some output even when the task fails?
a
@quick-helicopter-88984 does Failure node match what you're looking for?
q
@average-finland-92144 Thanks for the response but it doesn't exactly solve my problem. My team is an infra team and we creates tasks that are used by many of our users. We control the task (e.g., the trainer task) and our users use them in their own workflow (which we don't control). So we want this logic to be encapsulated in the task itself. Kind of like a regular function that can return information both on success (through ret value) and on failure (through an exception or error code). In our case, the info passed on failuyre is the location of checkpoints
Although reading the docs again, I think if we convert our task to a subworkflow, it can do this
a
Be aware that if the parent workflow fails, so will the subworkflow. I'll try to loop in someone with better insights to help you
q
Thank you so much
g
> I prefer that the user sees this as a failed task. But I still want to return some information to the user how should we pass the information to your users in this case? or let’s say we save this information somewhere in the backend. how do your users get this information. or they just want to see the info in the UI
f
Actually the location of checkpoints are already stored in the db, we currently do not show in the UI (if you are using the intra task checkpoint paths)
h
@quick-helicopter-88984, can you comment a bit more on how users could reuse checkpoints in your scenario? Let's say a task fails and we were able to bubble up the intra-task checkpoint address (as a property of the failed task), how are users supposed to use this in the executions of their workflows? Unfortunately we don't provide (yet?) a mechanism to inject checkpoints into task executions, but for the purposes of this exercise I just want to understand a bit more how this info could be used.
q
Hi . Sorry for the delay
@high-accountant-32689 @freezing-airport-6809 Here is one way I have seen it with another orchestrator: They had a concept called "side artifacts". It allowed a task to stream some output defined in the contract of the task. A task could produce these side artifacts while running (before being fully done). There was a constraint that you could not pass these side artifacts to downstream tasks (for obvious reasons). In that set up, a task like a model training task would have some regular output artifacts (trained model, final eval results) and some side artifacts (a stream/list of checkpoints and a stream/list of model eval results to be passed to TensorBoard)
This allowed two things:
• Showing your model eval results (which were created periodiclaly during training) on Tensorboard without relying on a hidden contract (e.g., a set location on disk or the trainer task making an RPC call to some service which is a kind of side effect) • If training failed, the user could find the set of checkpoints and then rerun the pipeline passing that checkpoint as an input to the trainer (the trainer task has an optional input called "warm start" that would take a checkpoint)
In general, I think supporting the notion of a side artifact is worth considering both for dealing with failures and for allowing folks to replace side effects of a task with something that gets defined in the interface (contract) of the task
f
you can stream data out of a task, it will not be commited
we are working on realtime decks that will allow for streaming viz out of tasks
cc @silly-toddler-37820
q
That's great. Do you have any docs on the current streaming capabilities?
f
you can always write all outputs as a stream
q
But can you make the location of that stream part of your interface? I want to avoid an implicit contract with the user of the task I am creating. Basically, I want the task itself to tell the user of the task where the data will be streamed to
yes stream is part of the interface
q
Yes, but if the task fails, there is going to be no output
(output as FlyteFile)
f
when you return outputs of a task for example a file, the only thing returned is the location
q
So this only works when te task succeeds
f
so you want to received the outputs even on failure right
ya this is not supported today, but with streaming decks this will work
let me share
q
Yes. For cases like model checkpoints
That's awesome. Looking forward to the streaming decks feature 🙂
f
q
🙏
f
please upvote