How do I implement dynamic epochs (e.g. early stop...
# ask-the-community
e
How do I implement dynamic epochs (e.g. early stopping) with flyte? Option A: define
train
and
eval
tasks and run them as long es condition is met in
@dynamic
Copy code
@task
def train():
    ...

@task
def eval():
    ...

@dynamic
def job():
    running = True
    
    while running:
        train_result = train()

        score = eval()

        if score > 0.5:
            running = False
Option B: One "giant" workflow without
train
and
eval
tasks:
Copy code
def train():
    ...

def eval():
    ...

@workflow
def job():
    running = True
    
    while running:
        train_result = train()

        score = eval()

        if score > 0.5:
            running = False
Are there advantages of one vs. the other? Would it be problematic that in Option A the DAG might grow very big?
b
Hi @ewam! I believe you will run into issues with both of your examples. Please note that while workflows look like regular Python code, they are actually a domain specific language under the hood. I find these docs helpful. So when you define a workflow, this defines a static graph of operations to happen. Generally speaking, the inputs and outputs define the order of operations. All of this happens at registration time, which leads to the graph being frozen already when you start it (i.e. no dynamic behaviour is possible) When you compare this to the
dynamic
task from your example, I like to think of of them us one node in the workflow graph wich can produce another (fixed) graph at workflow runtime. So while this gives you some dynamism, it is still a workflow that get’s produced once (which is why the
while True
will most likely not work, as this leads to an infinitely sized graph). You might get around this using a fixed sized loop and conditionals
e
Okay, thanks. What again is problematic about "Option B"? Note, that
train
and
eval
are not defined as tasks!
b
The problem is that you can’t run arbitrary Python code in a
workflow
(as it is just a description of a graph (DAG)). So you could wrap all of this in a
task
instead of a
workflow
, but this might not be what you want
e
ah, yes, I meant to write
@task
of course. Sorry about the confusion.
Are you aware of the new signaling features btw? I wonder how they fit into the DAG structure. For instance, In this video (

https://youtu.be/njNKBke5sQ0?t=295

) , they talk about "periodic triggers" within `@dynamic`workflows. Unfortunately, no example is given.
b
As far as I understand signalling, those are nodes in the DAG which need to be unblocked (either be a human or some other signal)
e
So you agree with me that "periodic triggers" are maybe a bit misleading?
I'll go just with the one-huge-task implementation then
Its a bid sad, because splitting
training
and
evalutation
would give me the ability for parallel execution (i.e. save time for the customer)
n
This use case justifies our ideas around support for asynchronous tasks/“eager” mode @Ketan (kumare3). I think this, and event-based triggers/reactive triggers really should be supported.
For early stopping @ewam curious what your thinking is re: doing training loops at the workflow level instead of at the task level. IMO training loops are tight enough that they should be implemented at a task level, in which case early stopping is easy to do.
e
@Niels Bantilan re: async/eager: Great to hear that you already have some discussions going. The thing about "real-world" ML workflows (as in "MLaaS") is, that they can be highly dynamic. Which doesn't fit well into the DAGs "static" world. re: train at workflow vs at task level: as the "training" of a model (from data to production-model) not only consist of
def train()
, I'm 100% convinced it should be implemented as a workflow. Training a model could need the following tasks: • data-loading • data-processing • model-loading • model-pretraining (could be even a sub-workflow) • architecture-search / auto-ml / hyperparameter-search • training on the training dataset • evaluation on the eval dataset • model conversion / export / storage Just realized that your might just be about
def train()
as a task vs workflow: In this case it should be a task I think.
n
I agree that all those steps makes sense to implement as a workflow, but maybe “early stopping” in this context is overloaded: do you mean it in the sense the you stop early during a tight training loop of a single model, or in a hyperparameter tuning context?
k
I also think the above workflow is pretty static. Hyperparameter tuning can be implemented as ray above? Though on a side note —- This is a very interesting conversation- let’s get an rfc for eager mode out. And see what folks think. There is an overhead today, but we can optimize this later
e
do you mean it in the sense the you stop early during a tight training loop of a single model, or in a hyperparameter tuning context?
I mean it in the sense of "customer is satisfied with dice-score >= 0.8 -> no need to train further and pay for ressources"
I also think the above workflow is pretty static. Hyperparameter tuning can be implemented as ray above?
yes, it is. Forgive me if the question was specially for a "dynamic" workflow. Than I can put up another example (e.g. what we do currently in prod when training networks for customers).
n
I mean it in the sense of “customer is satisfied with dice-score >= 0.8 -> no need to train further and pay for ressources”
cool, so just to clarify: the
dice-score
is something that the customer determines for some hyperparameter configuration?
k
@ewam btw, great to meet you and thank you for joining the community. please drop a line in the introductions or here
155 Views