Hello, Flyte Team! I have a question about flyte ...
# flyte-support
g
Hello, Flyte Team! I have a question about flyte retry policy. I have checked flyte re-run failed task in the same workflow. But tasks re-return in a long term, for example, the failed task run after a day. some succeeded task is also same. running again after a day or more than 10 hours. Does anyone know why flyte task run with a long term interval?
this is the one succeeded task run several times after first task finished
I have also one more question flyte launchplans are not archived automatically after I archived flyte project? I have checked launchplans in the archived projects are still running and they make error in flyte-scheduler. would anyone check this issue?
t
I don't understand. Why is a successful task being retried again? Also retries interval shouldn't be that large. cc @thankful-minister-83577
I have checked launchplans in the archived projects are still running and they make error in flyte-scheduler.
What kind of error are you seeing?
g
my team is doubting scheduler error. flyte-scheduler was restarted many times before. we have seen logs in flyte-scheduler pod that the scheduler find launchplans in achived flyte project (i.e. flytesnacks, flytetesters, flyteexamples) I thought launchplans become archived automatically if flyte project became archived but it is not built in the source codes ;( (I have checked this issue is "TODO") After we archived all activated launchplans in the archived project, flyte-scheduler stopped to restart. we are expecting not to run succeeded task again, but we also have no idea why it runs.
t
@thankful-minister-83577 @high-accountant-32689, two issues here. 1. Archiving a project is not archiving activated launchplans present in the project. 2. A successful task is being retried.
b
maybe we found multiple execute case. 1. archive a project but scheduled launchplan is active state. 2. flyte scheduler restarts for some reason 3. flyte scheduler read snapshot and start catchup all scheduled launchplan. 4. some launchplan is part of an archived project, so project is not active errors continue to occur. 5. flyte scheduler shutdown because of the continuous error. (failed to catch up on all the schedules. Aborting) 6. And snapshot is not created. 7. repeat 2~6 step
a
Hi @gifted-house-14547/@brief-oil-51532 Wondering if this is still a problem you're facing?
g
This error does not appear now after we fixed up flyte scheduler error
a
Cool, thanks for confirming Any hint on how you ended up fixing it?
g
flyte scheduler tried to run launchplans in archived projects, but the schduler cannot find those launchplans. Thus, it continuously restarted, which results in retrials of tasks whether they succeeded or not.
It seems very critical bug because my team load tick data (basically stock data), the scheduler makes duplicated load of the data.
a
do you happen to have logs for the
flytescheduler
pod?
g
yeah, sure. I made trouble shooting docs. hang on a second.
thx 1
Here is the sample logs and the scheduler pod status.
Copy code
{
  "json": {
    "src": "executor_impl.go:121"
  },
  "level": "error",
  "msg": "failed to create execution create request %+v due to %v after all retriesproject:\"test\" domain:\"development\" name:\"fb441d3aa00fd1773000\" spec:<launch_plan:<resource_type:LAUNCH_PLAN project:\"test\" domain:\"development\" name:\"sunday_launch_plan\" version:\"174a603\" > metadata:<mode:SCHEDULED scheduled_at:<seconds:1665273600 > > > inputs:<>  rpc error: code = InvalidArgument desc = project [test] is not active",
  "ts": "2023-01-16T07:51:15Z"
}
I guess the scheduler pod restarts every 30 minutes because of the error above
a
so this is still an execution of a project that you intended to archive but even so the launchplan remains active, correct? Sorry if you have been asked similar things several times, just trying to isolate the issue
g
No worry. I am very worried someone will have trouble with the same issue. we should make reference through this community. I archived all the launchplans in the archived projects with flytectl commands manually.
a
ok logs say that the
test
project is not active. Is this the one you archived?
also, we can hope on a call now if it´s easier for you
g
I am sorry. I cannot proceed a call now. I have archived launchplans in
test, flytesnacks, and several more projects my team made
I monitored logs in the scheduler pod after I archived all launchplans in the archived projects. It does not make any scheduler error now. The image below is the current status of flyte namespace.
a
also wondering how much of your problem is similar to this: https://github.com/flyteorg/flyte/issues/3109
g
• It is the same situation the issue author mentioned,
When a user deletes a launch plan from code, and the user doesn't manually deactivate the launch plan, it will continue to run. There's also no way to query "old" launch plans.
in the link. However, the function I expect is the activated launchplans should be archived when the projects they are included become archived.
a
as it's very related, I mentioned you on the issue (I hope I got the GH user right 🙂 ) Feel free to comment/expand further
thx 1
171 Views