I m also seeing several workflow executions abort with the e Flyte #flyte-support

I’m also seeing several workflow executions abort ...

lively-engineer-47231

03/10/2023, 10:57 PM

I’m also seeing several workflow executions abort with the error

Workflow execution not found in flyteadmin.

which I don’t fully understand.

freezing-airport-6809

03/10/2023, 11:07 PM

When do they abort

freezing-airport-6809

03/10/2023, 11:07 PM

Your db seems off

freezing-airport-6809

03/10/2023, 11:07 PM

What postgres are you using

wooden-airline-96737

03/10/2023, 11:08 PM

RDS 14.5

wooden-airline-96737

03/10/2023, 11:09 PM

Looking at flyteadmin logs, the following statement returns 0 rows, then it goes onto the killing spree, and 15 seconds the row is back…

Copy code

0m[33m[0.752ms] [34;1m[rows:1][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT 1

wooden-airline-96737

03/10/2023, 11:15 PM

Copy code

[0m[33m[7.721ms] [34;1m[rows:0][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT 1

Failed to find existing execution with id [project:"flyte-dev" domain:"production" name:"re123" ] with err: missing entity of type execution with identifier project:"flyte-dev" domain:"production" name:"re123" 

Failed to record task event [task_id:<resource_type:TASK project:"flyte-dev" domain:"production"....

Then we end up with KillTask invoked and Deletion Triggered fro re123. But then 4 minutes later. it’s back:

Copy code

[0m[33m[0.752ms] [34;1m[rows:1][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT

wooden-airline-96737

03/10/2023, 11:20 PM

Flyte deployment is pretty stock via helm chart, with dedicated stock rds instance with logical_replication enabled and no redis in the mix.

freezing-airport-6809

03/10/2023, 11:24 PM

That does don’t sound like RDS right

freezing-airport-6809

03/10/2023, 11:24 PM

That sounds like a database correctness issue

freezing-airport-6809

03/10/2023, 11:25 PM

We have never seen this

freezing-airport-6809

03/10/2023, 11:27 PM

Don’t you think something is odd

glamorous-carpet-83516

03/10/2023, 11:30 PM

do all the tasks get the same error or only this task?

lively-engineer-47231

03/10/2023, 11:32 PM

I’ll see tasks output

Some node execution failed, auto-abort.

and the workflow also states

Workflow execution not found in flyteadmin.

lively-engineer-47231

03/10/2023, 11:33 PM

typically, all the tasks abort with the same auto-abort error

wooden-airline-96737

03/10/2023, 11:39 PM

I’m happy to spin up a new db to test this out. Docs suggest Aurora with Postgres compatibility. Is that what you’d recommend?

freezing-airport-6809

03/10/2023, 11:40 PM

Ya we have user aurora a lot and it works great

wooden-airline-96737

03/10/2023, 11:42 PM

Sounds good I’ll spin up 14.6 and move the dbs over

freezing-airport-6809

03/11/2023, 12:20 AM

cool, please let us know if you see this behavior

thankful-minister-83577

03/11/2023, 12:23 AM

can you search for this line in your logs?

Copy code

The node execution launched an execution but it does not exist

freezing-airport-6809

03/11/2023, 12:23 AM

@lively-engineer-47231 / @wooden-airline-96737 Wait - @thankful-minister-83577 thinks this is red-herring

thankful-minister-83577

03/11/2023, 12:23 AM

(this would be admin logs if you’re running them as separate services)

lively-engineer-47231

03/11/2023, 12:28 AM

I don’t see that line, but I do see /go/src/github.com/flyteorg/flyteadmin/pkg/repositories/gormimpl/task_execution_repo.go:55 record not found
in the admin logs

wooden-airline-96737

03/11/2023, 1:02 AM

One thing I missed while focusing on the logs was that our propeller was ooming, upped the memory limit and we got a step further I believe.

average-finland-92144

03/13/2023, 4:05 PM

@wooden-airline-96737 / @lively-engineer-47231 what's the current status?

wooden-airline-96737

03/13/2023, 7:10 PM

Stuck on:

Copy code

Attempt 01
aborted
Some node execution failed, auto-abort.

Cant find anything useful. Kube logs show pod getting terminated, nothing out of the ordinary, except that the pod is then gone vs left in Successful state. propeller logs show nothing other than handling Abort event…

wooden-airline-96737

03/13/2023, 7:12 PM

And this happens on random nodes relaunching the same workflow

wooden-airline-96737

03/13/2023, 8:50 PM

Welp, running two flyte’s in different namespaces was the problem. All problems gone by running just one flyte install cluster wide.

🙌 2

158 Views

Open in Slack

Previous Next