Hi team, I’m running into a “Checkpointing not ava...
# flyte-support
l
Hi team, I’m running into a “Checkpointing not available” error after recently upgrading flytekit. I’m on flytekit==1.13 and flyte backend 1.13. Verified from the docs here that the usage is correct Code snippet:
Copy code
cp = flytekit.current_context().checkpoint
encoded_name = cp.read()
Copy code
cp = flytekit.current_context().checkpoint
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 262, in checkpoint
        raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")
f
that does not make sense, let us reproduce
l
This is an existing workflow that has now started failing 😞
f
Sorry about that
This has not changed
Cc @thankful-minister-83577 can you check please
If it’s an issue we will Have to file a bug
t
what version were you on before?
checkpointing has not changed in a while.
f
@thankful-minister-83577 can we just run a test on our platform that’s what I think
It might be some config on her phone nd
l
went from 1.11 to 1.13
let me run it again and see if it happens again
This is quite bizarre… The task has retries.. so first try fails with the error
Copy code
File "/opt/venv/lib/python3.11/site-packages/flytekit/exceptions/scopes.py", line 219, in user_entry_point
        return wrapped(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/root/flyte/workflows/spark/mapmatcher.py", line 200, in run_mapmatch
        run_spark_app(app)
      File "/root/flyte/workflows/spark/utils.py", line 68, in run_spark_app
        cp = flytekit.current_context().checkpoint
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/venv/lib/python3.11/site-packages/flytekit/core/context_manager.py", line 259, in checkpoint
        raise NotImplementedError("Checkpointing is not available, please check the version of the platform.")

Message:

    NotImplementedError: Checkpointing is not available, please check the version of the platform.
Then the next attempt is successful..
With 1.11 the map task actually retried.. with 1.13 the map task (array node) did not retry and just failed completely
I have cp in another workflow as well which works just fine even with 1.13… so not sure what’s causing the flakiness… and not retrying the failed map task is definitely concerning
@thankful-minister-83577 @freezing-airport-6809 ^
This is with 1.11
This is with 1.13
This is with 1.11
How it’s invoked:
Copy code
map_task(
        run_function, concurrency=DEFAULT_CONCURRENCY
    )(partitioned_input=partitioned_inputs).with_overrides(
        limits=Resources(mem="1Gi"),
        retries=NUM_RETRIES,
    )
Looks like for 1.13 / ArrayNode I may need to just put it in the task metadata? ``````
t
hey @little-cricket-84530 yeah, to control retries for the map task itself, you will need to set the task metadata field
Copy code
map_task(t1, metadata=TaskMetadata(retries=N))(a=1)
l
Tried that.. still no retries
new code:
Copy code
map_task(
        run_function,
        concurrency=DEFAULT_CONCURRENCY,
        metadata=TaskMetadata(
            cache=True, cache_version="0.0.1", retries=NUM_RETRIES, interruptible=False
        ),
    )(partitioned_input=partitioned_inputs).with_overrides(
        limits=Resources(mem="1Gi"),
    )
and same error about missing CP implementation
t
let me make a repro
mind sharing your cp code btw?
l
Copy code
cp = flytekit.current_context().checkpoint
encoded_name = cp.read()
Fails on the first line
this is from the map task
Is there a problem with using CP from Array node ?
AND it isn’t retrying the failed array node 😐
even with the task metadata
Verified once again by downgrading flytekit that this is indeed the pattern 1.11 Map task fails first time due to “Checkpointing not available”… then retries and succeeds 1.13 Array node fails first time due to “Checkpointing not available”.. and then DOES NOT retry despite providing TaskMetadata
@thankful-minister-83577 ^
t
but that shouldn’t affect retries
l
any way to debug why retries aren’t working?
t
still looking.
can you tell me what backend version you’re running please?
l
1.13
t
cc @high-accountant-32689 who was also looking through this. I think it should be fine. The only thing is that you’ll need to raise a FlyteRecoverableException
This is the test code we’ve been running
Untitled
this will work as expected, the first two retries fail in the 3rd map task instance, and then succeeds. but the checkpoint code only works with that patch.
which we will have out tomorrow. still adding tests to it.
l
Thanks.. I’ll give this a try
t
could you bump to 13.13 also please @little-cricket-84530?
l
thanks!
I’ll give it a shot
t
i was able to get the code a few messages up to run. (there’s an improvement i think we can make to the synccheckpoint class -
prev_exists
doesn’t seem to actually check if the folder exists, but that’s a separate issue we can address later)
l
so do I need to just update.. or add the code you mentioned in the wf?
t
update please
1.13.13 fixes the cp (in the map task case)
l
I’m updating the version
that’s all that’s needed right?
t
as long as you’re not relying on prev_exists to always be accurate yes