:thread:: is it recommended or some way to modify ...
# flyte-support
b
🧵: is it recommended or some way to modify the workflow before the registration phase or perhaps some mutating webhook to modify the workflow? The use case is that we want to modify the users specified workflow with appending an additional task that users don’t need to care about
t
if you want to modify, before registration is the place to do it, since the compiler check comes after that.
what’s the use-case out of curiosity?
flytekit recently added an
on_failure
argument… are you looking for a
finally
?
b
before registration is the place to do it,
yeah, we can ā€œforceā€ users to specify that in the DSL. but what we are looking for is to inject that task/node without users explictly write it in DSL.
> are you looking for a
finally
? Sort of.
the use case is that we are managing the k8s resources from a custom task by agent, users can use that custom task with other types of traniningtask in a workflow. once that custom task is succeeded w.r.t. k8s resources are creates successfully and running and succeeded, there is no where in the flyte is killing the k8s resources. so we want to kill the k8s resources after the training tasks is done
right now the other idea I was exploring/tryint out is to create another sensor agent to do the orchestrated GC work šŸ˜…
flytekit recently added an
on_failure
argument…
you mean this PR: https://github.com/flyteorg/flytekit/pull/840?
t
yeah that one, just as an fyi, not relevant here.
could you elaborate a bit? when do you want the cleanup to happen, asap? or after the whole workflow is done?
b
Thanks for asking the question. It is a good one. after the user defined workflow is done. user can also decide if they want to delete the k8s resource.
t
usually when a custom task creates a custom k8s resource (like spark), it’s up to the operator cleanup, and i think propeller deletes the CRD (after some time)
b
ahh in our case, we did not choose the operator route. basically, we have a custom agent to create, get, delete the k8s resource based on the custom task configuration. the k8s resource can be deleted nicely if the custom task is running state and gets aborted from the UI But once the task succeeded, the k8s resources are hanging there, (forever)
t
this is an AgentTask? like the Flyte Agent?
b
yes
t
you’re saying delete gets called on Abort but not on success?
the person who wrote most of that is out atm, but i’ll check with him to see if there was ever talk about adding a cleanup phase. is it possible to add the cleanup just as the last step before get returns success?
b
basically like this. Task 1 and Task 2 is executed in parallel -- Task 1 (custom agent task that manages the k8s resources) | -- Task 2 (training job such as MPIJob, or TFJob, etc) We want to selectively delete the k8s resources from task 1 once task 2 succedded
t
can you elaborate on selectively?
b
selectively, means: if users specify, they don’t want to delete the resources created in task 1, we will keep it no need to delete k8s resources. otherwise, we will cleanup the k8s resources from taks 1, once task 2 succeeded
t
so an agent task configurable by the user, to run
delete
as part of success. not sure if this is in-scope for the agent interface… and delete will need to be idempotent ofc.
also i wasn’t aware we had piped the k8s api through to agent tasks.
b
the idempotency is handled by the K8s API itself. > also i wasn’t aware we had piped the k8s api through to agent tasks. oh yeah, we (from LinkedIn) currently is doing it for graph neural network training infra šŸ˜‰
t
let me follow-up a bit internally
b
The primary reason we chose flyte agent route is to leverage the flyte agent framework without changes across the stack related flyte proto, backend plugin, SDK and easy to test iterate and operate in production
Kevin recently has helped some of my problems as well.
t
oh i agree. backend plugins are substantially more complicated.
b
yeah though I am a k8s developer, was preferring the custom k8s operator and backend plugins only need to create custom resources. but the changes were evaluated to a bit hairy than custom agent. but then we will need to sort out other problems šŸ˜…
for example, in addition to this, maybe I can put another thread does the flyte support sharing task node? say multiple workflows can refer to the same task node?
t
and cleanup can still be invoked from get… but that is very hacky. get should not be mutating
b
if it does not support it, which can be fine, then we are wondering if we can referring to the k8s resources that were created by a custom agent task in another workflow?
t
ummm no, nodes (of any kind) are unique to a workflow.
b
yea I guess so…
> and cleanup can still be invoked from get… good thought haha. though
get
is only invoked before task is succeeded once hte task ā€œSucceededā€, the
get
is not called again
t
cannot/should not right? that other workflow has it’s own lifecycle.
it can get cleaned up at any time.
b
or make it even hackier: I can mark the Task1 always running. šŸ˜…
t
the only way to access it is if you make it truly a shared resource (like how google.com is a shared resource). or rebuild the lifecycle management stuff we’ve built internally for the union Actor support.
but not a small undertaking, i can promise
b
yeah exactly. the distributed mult-node training for graph neural network is complicated everything tbh
I am open to do any hacks atm lol
but I think your idea to check in the
get
call might work. basically I marked the task 1 never succeeding, and so the
get
call will always be invoked. in the
get
call, I will check the Task2 training job type execution status, if it is completed, I can just deleted the k8s resources.
my other option I mentioned is a sensor cleanup agent… but it may impact the user workflow completion because the sensor execution will be part of the workflow
t
oh task 1 & 2 are sibling tasks, that you want to share resource?
b
task 1 & 2 is not sibling. but they have relationship internally. task 2 will health check k8s resources from task 1 and query the graph/data from k8s resources created in task 1. we internally handle the dependency (without needing task 1 to be executed first)
t
i see
b
if one task in the workflow is in Running state, the workflow will not be marked as completed, right?
can we mark the workflow as
succeeded
or failed or completed if one of the task that we care completed, failed /succeeded?
t
manually mark as succeeded? no… any running node will indicate to propeller that the wf is still running.
b
no i mean is there API or configuration to do so? there is task 1 and 2, we only care about the state of task 2 and hence want to make the workflow state reflects that.
t
no… what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.
it’s maybe worth it to consider if the lifecycle management can be done via sidecar containers, init containers, within a pod template. but that is a very different lower level approach. almost might be worth it to make a gh issue describing the scenario at a high level. if there’s a lot of interest, maybe there is a pattern that is missing from the flyte/flytekit api.
b
what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.
a sidecar task is still a flyte task. it will be part of the flyte API/ecosystem, no?
sidecar containers, init containers, within a pod template.
actually, this might not be a bad idea if we need to go down to lower level : )
a gh issue describing the scenario at a high level
Will do while I am trying out approaches…
btw, thanks for the brainstorming! I worked it out with custom sensor (knowing how to talk to k8s and do selective deletion) at the moment. Injecting the sensor node will be later for UX purpose. : )