occasionally my my workflow fails when a task gets...
# flyte-support
a
occasionally my my workflow fails when a task gets this error:
Copy code
/1] currentAttempt done. Last Error: USER::Grace period [3m0s] exceeded|containers with unready status: [primary]|Back-off pulling image "<http://mycompany.com/buck/horse:19.2.0|mycompany.com/buck/horse:19.2.0>": ErrImagePull: failed to pull and unpack image "<http://mycompany.com/buck/horse:19.2.0|mycompany.com/buck/horse:19.2.0>": failed to copy: read tcp 100.64.44.213:38748->10.37.121.189:443: read: connection reset by peer
this seems like a random node-related issue, is the best solution here to have all my tasks have retries set? (and is there a way to do this at the flyte level rather than changing all invocations at the flytekit level?)
a
Hi @average-secretary-61436 The
retries
you set in task config are for code-level issues For this one there's a setting you can configure to increase the grace period from the default 3m
a
cool I'll try upping that and see what happens! appreciate it!
f
its weird that you are getting err imgpull - connection is failing with the registry
a
I haven't been able to figure out if it is our registry problem or with the node... but hopefully a longer timeout fixes it.