I’m also seeing several workflow executions abort ...
# ask-the-community
g
I’m also seeing several workflow executions abort with the error
Workflow execution not found in flyteadmin.
which I don’t fully understand.
k
When do they abort
Your db seems off
What postgres are you using
v
RDS 14.5
Looking at flyteadmin logs, the following statement returns 0 rows, then it goes onto the killing spree, and 15 seconds the row is back…
Copy code
0m[33m[0.752ms] [34;1m[rows:1][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT 1
Copy code
[0m[33m[7.721ms] [34;1m[rows:0][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT 1

Failed to find existing execution with id [project:"flyte-dev" domain:"production" name:"re123" ] with err: missing entity of type execution with identifier project:"flyte-dev" domain:"production" name:"re123" 

Failed to record task event [task_id:<resource_type:TASK project:"flyte-dev" domain:"production"....
Then we end up with KillTask invoked and Deletion Triggered fro re123. But then 4 minutes later. it’s back:
Copy code
[0m[33m[0.752ms] [34;1m[rows:1][0m SELECT * FROM "executions" WHERE "executions"."execution_project" = 'flyte-dev' AND "executions"."execution_domain" = 'production' AND "executions"."execution_name" = 're123' LIMIT
Flyte deployment is pretty stock via helm chart, with dedicated stock rds instance with logical_replication enabled and no redis in the mix.
k
That does don’t sound like RDS right
That sounds like a database correctness issue
We have never seen this
Don’t you think something is odd
k
do all the tasks get the same error or only this task?
g
I’ll see tasks output
Some node execution failed, auto-abort.
and the workflow also states
Workflow execution not found in flyteadmin.
typically, all the tasks abort with the same auto-abort error
v
I’m happy to spin up a new db to test this out. Docs suggest Aurora with Postgres compatibility. Is that what you’d recommend?
k
Ya we have user aurora a lot and it works great
v
Sounds good I’ll spin up 14.6 and move the dbs over
k
cool, please let us know if you see this behavior
y
can you search for this line in your logs?
Copy code
The node execution launched an execution but it does not exist
k
@Gerry Meixiong / @Viljem Skornik Wait - @Yee thinks this is red-herring
y
(this would be admin logs if you’re running them as separate services)
g
I don’t see that line, but I do see
/go/src/github.com/flyteorg/flyteadmin/pkg/repositories/gormimpl/task_execution_repo.go:55 record not found
in the admin logs
v
One thing I missed while focusing on the logs was that our propeller was ooming, upped the memory limit and we got a step further I believe.
d
@Viljem Skornik / @Gerry Meixiong what's the current status?
v
Stuck on:
Copy code
Attempt 01
aborted
Some node execution failed, auto-abort.
Cant find anything useful. Kube logs show pod getting terminated, nothing out of the ordinary, except that the pod is then gone vs left in Successful state. propeller logs show nothing other than handling Abort event…
And this happens on random nodes relaunching the same workflow
Welp, running two flyte’s in different namespaces was the problem. All problems gone by running just one flyte install cluster wide.
150 Views