How do I implement dynamic epochs e g early stopping with fl Flyte #flyte-support

How do I implement dynamic epochs (e.g. early stop...

flaky-car-61813

02/06/2023, 8:01 AM

How do I implement dynamic epochs (e.g. early stopping) with flyte? Option A: define

train

and

eval

tasks and run them as long es condition is met in

@dynamic

Copy code

@task
def train():
    ...

@task
def eval():
    ...

@dynamic
def job():
    running = True
    
    while running:
        train_result = train()

        score = eval()

        if score > 0.5:
            running = False

Option B: One "giant" workflow without

train

and

eval

tasks:

Copy code

def train():
    ...

def eval():
    ...

@workflow
def job():
    running = True
    
    while running:
        train_result = train()

        score = eval()

        if score > 0.5:
            running = False

Are there advantages of one vs. the other? Would it be problematic that in Option A the DAG might grow very big?

agreeable-kitchen-44189

02/06/2023, 8:15 AM

Hi @flaky-car-61813! I believe you will run into issues with both of your examples. Please note that while workflows look like regular Python code, they are actually a domain specific language under the hood. I find these docs helpful. So when you define a workflow, this defines a static graph of operations to happen. Generally speaking, the inputs and outputs define the order of operations. All of this happens at registration time, which leads to the graph being frozen already when you start it (i.e. no dynamic behaviour is possible) When you compare this to the

dynamic

task from your example, I like to think of of them us one node in the workflow graph wich can produce another (fixed) graph at workflow runtime. So while this gives you some dynamism, it is still a workflow that get’s produced once (which is why the

while True

will most likely not work, as this leads to an infinitely sized graph). You might get around this using a fixed sized loop and conditionals

✨ 2

🙌 2

flaky-car-61813

02/06/2023, 8:37 AM

Okay, thanks. What again is problematic about "Option B"? Note, that

train

and

eval

are not defined as tasks!

agreeable-kitchen-44189

02/06/2023, 8:39 AM

The problem is that you can’t run arbitrary Python code in a

workflow

(as it is just a description of a graph (DAG)). So you could wrap all of this in a

task

instead of a

workflow

, but this might not be what you want

flaky-car-61813

02/06/2023, 8:41 AM

ah, yes, I meant to write

@task

of course. Sorry about the confusion.

flaky-car-61813

02/06/2023, 8:43 AM

Are you aware of the new signaling features btw? I wonder how they fit into the DAG structure. For instance, In this video (

https://youtu.be/njNKBke5sQ0?t=295▾

) , they talk about "periodic triggers" within `@dynamic`workflows. Unfortunately, no example is given.

agreeable-kitchen-44189

02/06/2023, 8:49 AM

As far as I understand signalling, those are nodes in the DAG which need to be unblocked (either be a human or some other signal)

flaky-car-61813

02/06/2023, 9:00 AM

So you agree with me that "periodic triggers" are maybe a bit misleading?

flaky-car-61813

02/06/2023, 9:00 AM

I'll go just with the one-huge-task implementation then

flaky-car-61813

02/06/2023, 9:02 AM

Its a bid sad, because splitting

training

and

evalutation

would give me the ability for parallel execution (i.e. save time for the customer)

broad-monitor-993

02/06/2023, 2:51 PM

This use case justifies our ideas around support for asynchronous tasks/“eager” mode @freezing-airport-6809. I think this, and event-based triggers/reactive triggers really should be supported.

👍 4

broad-monitor-993

02/06/2023, 2:53 PM

For early stopping @flaky-car-61813 curious what your thinking is re: doing training loops at the workflow level instead of at the task level. IMO training loops are tight enough that they should be implemented at a task level, in which case early stopping is easy to do.

➕ 1

flaky-car-61813

02/08/2023, 7:43 AM

@broad-monitor-993 re: async/eager: Great to hear that you already have some discussions going. The thing about "real-world" ML workflows (as in "MLaaS") is, that they can be highly dynamic. Which doesn't fit well into the DAGs "static" world. re: train at workflow vs at task level: as the "training" of a model (from data to production-model) not only consist of

def train()

, I'm 100% convinced it should be implemented as a workflow. Training a model could need the following tasks: • data-loading • data-processing • model-loading • model-pretraining (could be even a sub-workflow) • architecture-search / auto-ml / hyperparameter-search • training on the training dataset • evaluation on the eval dataset • model conversion / export / storage Just realized that your might just be about

def train()

as a task vs workflow: In this case it should be a task I think.

broad-monitor-993

02/08/2023, 3:14 PM

I agree that all those steps makes sense to implement as a workflow, but maybe “early stopping” in this context is overloaded: do you mean it in the sense the you stop early during a tight training loop of a single model, or in a hyperparameter tuning context?

freezing-airport-6809

02/08/2023, 3:15 PM

I also think the above workflow is pretty static. Hyperparameter tuning can be implemented as ray above? Though on a side note —- This is a very interesting conversation- let’s get an rfc for eager mode out. And see what folks think. There is an overhead today, but we can optimize this later

flaky-car-61813

02/08/2023, 4:30 PM

do you mean it in the sense the you stop early during a tight training loop of a single model, or in a hyperparameter tuning context?

I mean it in the sense of "customer is satisfied with dice-score >= 0.8 -> no need to train further and pay for ressources"

flaky-car-61813

02/08/2023, 4:33 PM

I also think the above workflow is pretty static. Hyperparameter tuning can be implemented as ray above?

yes, it is. Forgive me if the question was specially for a "dynamic" workflow. Than I can put up another example (e.g. what we do currently in prod when training networks for customers).

broad-monitor-993

02/08/2023, 4:41 PM

I mean it in the sense of “customer is satisfied with dice-score >= 0.8 -> no need to train further and pay for ressources”

cool, so just to clarify: the

dice-score

is something that the customer determines for some hyperparameter configuration?

freezing-airport-6809

02/08/2023, 4:47 PM

@flaky-car-61813 btw, great to meet you and thank you for joining the community. please drop a line in the introductions or here

164 Views

Open in Slack

Previous Next