Hi team.. Quick question about schedules. We have ...
# flyte-support
l
Hi team.. Quick question about schedules. We have a workflow that has a launch plan for it to run once every 8 hours. The workflow does not take any inputs and tasks are not cached. For some bizarre reason it has been firing twice with the same version at the same instance for several days now. Any idea what could cause this? Here is the launch plan
Copy code
{
  "id": {
    "resourceType": "LAUNCH_PLAN",
    "project": "my_project",
    "domain": "production",
    "name": "my_lp",
    "version": "20949aa7bbea5421dfef9f96d4248c7a26a6f5d5"
  },
  "spec": {
    "workflowId": {
      "resourceType": "WORKFLOW",
      "project": "my_project",
      "domain": "production",
      "name": "some_wf",
      "version": "20949aa7bbea5421dfef9f96d4248c7a26a6f5d5"
    },
    "entityMetadata": {
      "schedule": {
        "rate": {
          "value": 8,
          "unit": "HOUR"
        }
      }
    },
    "defaultInputs": {},
    "fixedInputs": {},
    "labels": {},
    "annotations": {},
    "rawOutputDataConfig": {}
  },
  "closure": {
    "state": "ACTIVE",
    "expectedInputs": {},
    "expectedOutputs": {},
    "createdAt": "2023-04-26T18:23:39.503710Z",
    "updatedAt": "2023-04-26T18:23:39.503710Z"
  }
}
f
cc @icy-agent-73298 can you take a look please
hmm this should never happen, as the execution is deduped using the execution id and the execution id is deterministically generated
are you sure you do not have 2 launchplans?
@little-cricket-84530?
i
@little-cricket-84530 can you share the get execution o/p for the top two from the image .
l
So on a nightly basis we build and activate the launch plan with the latest version. So how can there be 2?
We do this using
Copy code
flytectl update launchplan --admin.endpoint $(FLYTE_ADMIN_HOST):443 -p $(FLYTE_PROJECT_NAME) -d $(FLYTE_PROJECT_DOMAIN) \
$(name) --activate --version $(IMAGE_VERSION)"
Copy code
---------------------- ------------------- ------------------------------------------ ------------- ----------- ---------------------- -------------------------------- ----------------- -------------------- -------------------- 
| NAME                 | LAUNCH PLAN NAME  | VERSION                                  | TYPE        | PHASE     | SCHEDULED TIME       | STARTED                        | ELAPSED TIME    | ABORT DATA (TRUNC) | ERROR DATA (TRUNC) |
 ---------------------- ------------------- ------------------------------------------ ------------- ----------- ---------------------- -------------------------------- ----------------- -------------------- -------------------- 
| f9618b52b437ae85d000 | my_pipeline_lp | b699495922ea41c0849e7bd03f190783475552d3 | LAUNCH_PLAN | SUCCEEDED | 2023-04-26T18:21:44Z | 2023-04-26T18:21:49.735450044Z | 3565.693426035s |                    |                    |
 ---------------------- ------------------- ------------------------------------------ ------------- ----------- ---------------------- -------------------------------- ----------------- -------------------- --------------------
---------------------- ------------------- ------------------------------------------ ------------- --------- ---------------------- -------------------------------- ---------------- -------------------- -------------------- | NAME | LAUNCH PLAN NAME | VERSION | TYPE | PHASE | SCHEDULED TIME | STARTED | ELAPSED TIME | ABORT DATA (TRUNC) | ERROR DATA (TRUNC) | ---------------------- ------------------- ------------------------------------------ ------------- --------- ---------------------- -------------------------------- ---------------- -------------------- -------------------- | f9318b52b437ae85d000 | my_pipeline_lp | b699495922ea41c0849e7bd03f190783475552d3 | LAUNCH_PLAN | ABORTED | 2023-04-26T182141Z | 2023-04-26T182147.419096843Z | 170.893382172s | Terminated from UI | | ---------------------- ------------------- ------------------------------------------ ------------- --------- ---------------------- -------------------------------- ---------------- -------------------- --------------------
Aborted the 2nd run after realizing that 2 executions had fired
i
Rupsha can you share the yaml version of the output which dumps a lot more data . Also you can add —details flag to get node level data aswell
l
@icy-agent-73298 FYI… let me know if you need anything else. One of them shows aborted because we killed it after realizing that 2 instances were running
i
Thanks @little-cricket-84530 for sharing this . Also was this schedule working perfectly fine earlier and started duplicating recently ? And if yes what changed . Also assuming executions starting with f7 and f6 are dups of each other. I am investigating more and how this can happen.
l
It worked fine until Jan 30 for sure..
I did recently bring up the production domain.. let me go back and check if it ever worked properly there
I definitely see runs that were too close to each other
message has been deleted
But the other pipelines with schedules are working perfectly
i
And is this for the same launchplan you shared earlier.
👍🏼 1
But the other pipelines with schedules are working perfectly
Thats odd that its just affecting this launchplan
l
I’ll check the repo to see if anything changed for this pipeline
scheduling wise
Nope.. nothing has changed since Aug last year (when the pipeline was created)
i
ok. Also can you list all launchplans with all versions in yaml format
Copy code
flytectl get launchplan -p project -d domain name -o yaml
l
ok.. hang on
all_launchplans.txt
👍 2
i
Also can you help verify for me that there were no restarts of any of admin or scheduler pods while this happened
Also another data point which could help too is when this deactivation and reactivation command run during the launchplans run. I am assuming the launch plan schedule ran , then it got deactivated for that version and new launch plan with new version got scheduled and ran . This only seems to affect fixed rate schedule due to there inherent behavior of running at fixed rate and not accepting a time arg. This behavior is fixed by the bug here https://github.com/flyteorg/flyte/issues/2885
l
ours was a fixed rate schedule
we usually have a job that runs at 3 am that checks out master, builds and deploys the images and activates the new version’s launch plan
Both admin and scheduler pods have definitely restarted in the last 30 days
what should I be updating to get the fix in? I see in our helm chart that we are on flyte-core-v0.1.10
i
Synced offline with Rupsha for images used for the chart and those are from 2022 Jun but they should work too. From the code i am unable to find how the jobStore map is storing fixed rate schedules for the same version of the lp https://github.com/flyteorg/flyteadmin/blob/master/scheduler/core/gocron_scheduler.go#L117 https://github.com/flyteorg/flyteadmin/blob/master/scheduler/core/gocron_scheduler.go#L115 The name of the schedule is generated from the lp name, project,domain and version and thats what stored in the inmemory jobstore And from execution details you posted they are for my_pipeline_lp with version b699495922ea41c0849e7bd03f190783475552d3 So ideally there should only be one identifier generated and the lookup should return that for the next ScheduleJob. @little-cricket-84530 i would needlogs to debug this issue. Scheduler and admin logs for that day should be good since i look at when this version got added b699495922ea41c0849e7bd03f190783475552d3 which seems to be at
2023-04-26T10:21:33.678573Z
The scheduledAt time is 3 sec apart
2023-04-26T18:21:44Z
2023-04-26T18:21:41Z
m
@icy-agent-73298 DMd you
i
Thanks Aleksei.
Couldn’t find much from the scheduler logs since lot of info level logs are missing . Would be great if we can check the flyte-scheduler-config and update the logger and restart scheduler
Copy code
logger.yaml: |
    logger:
      level: 6
I also wrote a unit to check for the above params like this
Copy code
func TestGetScheduleName(t *testing.T) {
	ctx := context.Background()
	schedule := models.SchedulableEntity{
		BaseModel: adminModels.BaseModel{
			ID:        2,
			UpdatedAt: time.Now(),
		},
		SchedulableEntityKey: models.SchedulableEntityKey{
			Project: "navigation",
			Domain:  "production",
			Name:    "daily_pipeline_8hr_lp",
			Version: "b699495922ea41c0849e7bd03f190783475552d3",
		},
		Unit:                admin.FixedRateUnit_HOUR * 8,
		KickoffTimeInputArg: "kickoff_time",
	}
	str := GetScheduleName(ctx, schedule)
	assert.Equal(t, "17193008309657693892", str)
}
And the GetScheduleName is only dependent on these
Copy code
Project: "navigation",
			Domain:  "production",
			Name:    "daily_pipeline_8hr_lp",
			Version: "b699495922ea41c0849e7bd03f190783475552d3",
https://github.com/flyteorg/flyteadmin/blob/master/scheduler/identifier/identifier.go#L26 From the previous conversations for these execution ID’s f9618b52b437ae85d000 and f9318b52b437ae85d000 the above launchplan params needed to generate schedule name are the same so this line should tell if another has been added https://github.com/flyteorg/flyteadmin/blob/master/scheduler/core/gocron_scheduler.go#L117 and add the schedule at this line https://github.com/flyteorg/flyteadmin/blob/master/scheduler/core/gocron_scheduler.go#L139 But for some wierd reason we are seeing L139 being executed twice for the same launchplan params . Suspecting some race if thats the case but still would be great to pin point from the logs this exact issue. Having the scheduler logs at debug level would help lot
I think this is extremely possible with current outstanding issue with fixed rate not accepting timestamp to run and having multiple replicas of scheduler . Can you help check if you are running 2 replicas of scheduler . This is the outstanding issue which i had posted earlier due to which this can happen https://github.com/flyteorg/flyte/issues/2885 Each scheduler picks up the fixed rate schedule and since they dont respect the last time the execution ran to be given as input, they create a new start time for the schedule and that yields to creating two different execution ID’s and both would be accepted by admin since they were unique. The unique of execution identifier though comes by appending the schedule time https://github.com/flyteorg/flyteadmin/blob/master/scheduler/identifier/identifier.go#L36 which would what differ when we have multiple replicas running of scheduler. This only affect fixed rate not cron schedules
m
We have 2 replicas of the scheduler
We'll scale it down to 1
i
cool that explains it then. Let us know if you face any issues again. Will try to prioritise that fix for the next release.
m
that solved it for now
thanks for the support!
👍 1
155 Views