thread is it recommended or some way to modify the workflow Flyte #flyte-support

:thread:: is it recommended or some way to modify ...

billowy-church-83438

09/18/2024, 8:00 PM

🧵: is it recommended or some way to modify the workflow before the registration phase or perhaps some mutating webhook to modify the workflow? The use case is that we want to modify the users specified workflow with appending an additional task that users don’t need to care about

thankful-minister-83577

09/19/2024, 12:07 AM

if you want to modify, before registration is the place to do it, since the compiler check comes after that.

thankful-minister-83577

09/19/2024, 12:07 AM

what’s the use-case out of curiosity?

thankful-minister-83577

09/19/2024, 12:08 AM

flytekit recently added an

on_failure

argument… are you looking for a

finally

billowy-church-83438

09/19/2024, 12:11 AM

before registration is the place to do it,

yeah, we can “force” users to specify that in the DSL. but what we are looking for is to inject that task/node without users explictly write it in DSL.

billowy-church-83438

09/19/2024, 12:11 AM

> are you looking for a

finally

? Sort of.

billowy-church-83438

09/19/2024, 12:11 AM

the use case is that we are managing the k8s resources from a custom task by agent, users can use that custom task with other types of traniningtask in a workflow. once that custom task is succeeded w.r.t. k8s resources are creates successfully and running and succeeded, there is no where in the flyte is killing the k8s resources. so we want to kill the k8s resources after the training tasks is done

billowy-church-83438

09/19/2024, 12:13 AM

right now the other idea I was exploring/tryint out is to create another sensor agent to do the orchestrated GC work 😅

billowy-church-83438

09/19/2024, 12:16 AM

flytekit recently added an
on_failure
argument…

you mean this PR: https://github.com/flyteorg/flytekit/pull/840?

thankful-minister-83577

09/19/2024, 12:22 AM

yeah that one, just as an fyi, not relevant here.

thankful-minister-83577

09/19/2024, 12:23 AM

could you elaborate a bit? when do you want the cleanup to happen, asap? or after the whole workflow is done?

billowy-church-83438

09/19/2024, 12:24 AM

Thanks for asking the question. It is a good one. after the user defined workflow is done. user can also decide if they want to delete the k8s resource.

thankful-minister-83577

09/19/2024, 12:24 AM

usually when a custom task creates a custom k8s resource (like spark), it’s up to the operator cleanup, and i think propeller deletes the CRD (after some time)

billowy-church-83438

09/19/2024, 12:26 AM

ahh in our case, we did not choose the operator route. basically, we have a custom agent to create, get, delete the k8s resource based on the custom task configuration. the k8s resource can be deleted nicely if the custom task is running state and gets aborted from the UI But once the task succeeded, the k8s resources are hanging there, (forever)

thankful-minister-83577

09/19/2024, 12:29 AM

this is an AgentTask? like the Flyte Agent?

billowy-church-83438

09/19/2024, 12:29 AM

yes

thankful-minister-83577

09/19/2024, 12:30 AM

you’re saying delete gets called on Abort but not on success?

thankful-minister-83577

09/19/2024, 12:31 AM

the person who wrote most of that is out atm, but i’ll check with him to see if there was ever talk about adding a cleanup phase. is it possible to add the cleanup just as the last step before get returns success?

billowy-church-83438

09/19/2024, 12:32 AM

basically like this. Task 1 and Task 2 is executed in parallel -- Task 1 (custom agent task that manages the k8s resources) | -- Task 2 (training job such as MPIJob, or TFJob, etc) We want to selectively delete the k8s resources from task 1 once task 2 succedded

thankful-minister-83577

09/19/2024, 12:32 AM

can you elaborate on selectively?

billowy-church-83438

09/19/2024, 12:33 AM

selectively, means: if users specify, they don’t want to delete the resources created in task 1, we will keep it no need to delete k8s resources. otherwise, we will cleanup the k8s resources from taks 1, once task 2 succeeded

thankful-minister-83577

09/19/2024, 12:35 AM

so an agent task configurable by the user, to run

delete

as part of success. not sure if this is in-scope for the agent interface… and delete will need to be idempotent ofc.

thankful-minister-83577

09/19/2024, 12:36 AM

also i wasn’t aware we had piped the k8s api through to agent tasks.

billowy-church-83438

09/19/2024, 12:37 AM

the idempotency is handled by the K8s API itself. > also i wasn’t aware we had piped the k8s api through to agent tasks. oh yeah, we (from LinkedIn) currently is doing it for graph neural network training infra 😉

thankful-minister-83577

09/19/2024, 12:39 AM

let me follow-up a bit internally

billowy-church-83438

09/19/2024, 12:41 AM

actually we have asked the question https://flyte-org.slack.com/archives/C06SYN9QJ5N/p1721885119188849

billowy-church-83438

09/19/2024, 12:43 AM

The primary reason we chose flyte agent route is to leverage the flyte agent framework without changes across the stack related flyte proto, backend plugin, SDK and easy to test iterate and operate in production

billowy-church-83438

09/19/2024, 12:44 AM

Kevin recently has helped some of my problems as well.

thankful-minister-83577

09/19/2024, 12:45 AM

oh i agree. backend plugins are substantially more complicated.

billowy-church-83438

09/19/2024, 12:47 AM

yeah though I am a k8s developer, was preferring the custom k8s operator and backend plugins only need to create custom resources. but the changes were evaluated to a bit hairy than custom agent. but then we will need to sort out other problems 😅

billowy-church-83438

09/19/2024, 12:48 AM

for example, in addition to this, maybe I can put another thread does the flyte support sharing task node? say multiple workflows can refer to the same task node?

thankful-minister-83577

09/19/2024, 12:48 AM

and cleanup can still be invoked from get… but that is very hacky. get should not be mutating

billowy-church-83438

09/19/2024, 12:49 AM

if it does not support it, which can be fine, then we are wondering if we can referring to the k8s resources that were created by a custom agent task in another workflow?

thankful-minister-83577

09/19/2024, 12:49 AM

ummm no, nodes (of any kind) are unique to a workflow.

billowy-church-83438

09/19/2024, 12:49 AM

yea I guess so…

billowy-church-83438

09/19/2024, 12:50 AM

> and cleanup can still be invoked from get… good thought haha. though

get

is only invoked before task is succeeded once hte task “Succeeded”, the

get

is not called again

thankful-minister-83577

09/19/2024, 12:50 AM

cannot/should not right? that other workflow has it’s own lifecycle.

thankful-minister-83577

09/19/2024, 12:50 AM

it can get cleaned up at any time.

billowy-church-83438

09/19/2024, 12:51 AM

or make it even hackier: I can mark the Task1 always running. 😅

thankful-minister-83577

09/19/2024, 12:51 AM

the only way to access it is if you make it truly a shared resource (like how google.com is a shared resource). or rebuild the lifecycle management stuff we’ve built internally for the union Actor support.

thankful-minister-83577

09/19/2024, 12:51 AM

but not a small undertaking, i can promise

billowy-church-83438

09/19/2024, 12:52 AM

yeah exactly. the distributed mult-node training for graph neural network is complicated everything tbh

billowy-church-83438

09/19/2024, 12:52 AM

I am open to do any hacks atm lol

billowy-church-83438

09/19/2024, 12:53 AM

but I think your idea to check in the

get

call might work. basically I marked the task 1 never succeeding, and so the

get

call will always be invoked. in the

get

call, I will check the Task2 training job type execution status, if it is completed, I can just deleted the k8s resources.

billowy-church-83438

09/19/2024, 12:54 AM

my other option I mentioned is a sensor cleanup agent… but it may impact the user workflow completion because the sensor execution will be part of the workflow

thankful-minister-83577

09/19/2024, 12:54 AM

oh task 1 & 2 are sibling tasks, that you want to share resource?

billowy-church-83438

09/19/2024, 12:55 AM

task 1 & 2 is not sibling. but they have relationship internally. task 2 will health check k8s resources from task 1 and query the graph/data from k8s resources created in task 1. we internally handle the dependency (without needing task 1 to be executed first)

thankful-minister-83577

09/19/2024, 12:55 AM

i see

billowy-church-83438

09/19/2024, 12:57 AM

if one task in the workflow is in Running state, the workflow will not be marked as completed, right?

billowy-church-83438

09/19/2024, 12:57 AM

can we mark the workflow as

succeeded

or failed or completed if one of the task that we care completed, failed /succeeded?

thankful-minister-83577

09/19/2024, 12:58 AM

manually mark as succeeded? no… any running node will indicate to propeller that the wf is still running.

billowy-church-83438

09/19/2024, 12:59 AM

no i mean is there API or configuration to do so? there is task 1 and 2, we only care about the state of task 2 and hence want to make the workflow state reflects that.

thankful-minister-83577

09/19/2024, 1:03 AM

no… what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.

thankful-minister-83577

09/19/2024, 1:07 AM

it’s maybe worth it to consider if the lifecycle management can be done via sidecar containers, init containers, within a pod template. but that is a very different lower level approach. almost might be worth it to make a gh issue describing the scenario at a high level. if there’s a lot of interest, maybe there is a pattern that is missing from the flyte/flytekit api.

billowy-church-83438

09/19/2024, 4:43 AM

what you’re describing sounds like a sidecar task. that is not part of the flyte crd api.

a sidecar task is still a flyte task. it will be part of the flyte API/ecosystem, no?

sidecar containers, init containers, within a pod template.

actually, this might not be a bad idea if we need to go down to lower level : )

a gh issue describing the scenario at a high level

Will do while I am trying out approaches…

billowy-church-83438

09/19/2024, 6:42 PM

btw, thanks for the brainstorming! I worked it out with custom sensor (knowing how to talk to k8s and do selective deletion) at the moment. Injecting the sensor node will be later for UX purpose. : )

31 Views

Open in Slack

Previous Next