hello again - i have deployed flyte to AKS and hav...
# ask-the-community
c
hello again - i have deployed flyte to AKS and have a very simple workflow: a single task that returns a string. i am using a recent artifact from master because i integrated with Azure blob storage. When I run the workflow via
pyflyte
the task pod completes successfully and i see metadata written to my blob store; the
n0
dir has both input and output and a dir is created for
end-node
which contains
inputs.pb
. however, the execution never finishes. it remains in the
RUNNING
state and the
end-node
task/event lists as
QUEUED
when i look at execution details:
Copy code
└── start-node - SUCCEEDED - 2023-10-12 21:27:35.186157884 +0000 UTC - 2023-10-12 21:27:35.186197485 +0000 UTC
└── n0 - SUCCEEDED - 2023-10-12 21:27:35.199313006 +0000 UTC - 2023-10-12 21:27:40.65606235 +0000 UTC
│   ├── Attempt :0
│   │   ├── Task - SUCCEEDED - 2023-10-12 21:27:35.262872565 +0000 UTC - 2023-10-12 21:27:40.644027555 +0000 UTC
│   │   ├── Task Type - python-task
│   │   ├── Reason - [ContainersNotReady|ContainerCreating]: containers with unready status: [f81959f07bf634b93b9c-n0-0]|
│   │   ├── Metadata
│   │   │   ├── Generated Name : f81959f07bf634b93b9c-n0-0
│   │   │   ├── Plugin Identifier : container
│   │   │   ├── External Resources
│   │   │   ├── Resource Pool Info
│   │   ├── Logs :
│   ├── Outputs :
│       └── o0: test-response
└── end-node - QUEUED - 2023-10-12 21:27:40.69112181 +0000 UTC - 2023-10-12 21:27:40.712353931 +0000 UTC
i don't see any error in the task pod but i do see this error in the flyte pod:
E1012 21:27:40.723260    7 workers.go:102] error syncing 'flyte-az-development/f81959f07bf634b93b9c': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "f81959f07bf634b93b9c": the object has been modified; please apply your changes to the latest version and try again
i suspect i have something misconfigured but can't figure out what that is. i have tried both with an azure workload identity and raw creds and see the same behavior. any ideas?
if i add a second simple task, say to concat the original string with a new string, the second task remains in the
QUEUED
state and never runs
i also sometimes see
Workflow execution not found in flyteadmin.
as an error in the task pod when i try to re-run the workflow. I don't know if that is related or just a transient error
I should also point out that i'm using
flyte-binary
and an image i built using the Makefile
build_native_flyte
. the custom artifact doesn't have a code change but references a custom stow build
s
@Yee, any idea how to debug these issues?
y
can you search propeller logs?
or rather, flyte logs, but coming from the propeller component
this feels like a propeller misconfiguration.
c
I'll see if i can get more logs today. right now the only thing i see in the flyte pod log is the
E1012
above and stuff like:
Copy code
2023/10/13 16:00:21 /go/pkg/mod/gorm.io/gorm@v1.24.1-0.20221019064659-5dd2bb482755/finisher_api.go:509
[0.722ms] [rows:1] SELECT count(*) FROM pg_indexes WHERE tablename = 'artifacts' AND indexname = 'artifacts_dataset_uuid_idx' AND schemaname = CURRENT_SCHEMA()
...
feels like i must be missing some
@Yee i played around with logging levels and don't see anything interesting, but might be missing something. is there a separate propeller logging configuration? I have this in my `values.yaml`:
Copy code
# logging Specify configuration for logs emitted by Flyte
  logging:
    # level Set the log level
    level: 1
and this in my
.flyte/config.yaml
file:
Copy code
logger:
  show-source: true
  ### I have tried a variety of different levels
  level: 20
and this is my only propeller config (in
values.yaml
)
Copy code
propeller:
    # disabled Disables flytepropeller
    disabled: false
    # disabledWebhook Disables webhook only
    disableWebhook: false
    publish-k8s-events: true
    create-flyteworkflow-crd: true
y
use logging level 5.
it needs to be 5, not 6 or higher unf.
c
that was helpful - i had log level misconfigured. i now see a bunch of logs, but still can't explain the behavior. the workflow was clearly accessible (in postgres) earlier in the execution, so it's unclear what causes the failure after n0 succeeds. the
node_execution_events
show a normal flow but
end_node
remains
QUEUED
. (i think) this is the log leading up to propeller trying to pick up that event: likely pertinent block:
[Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"f42f996b920d8459b9fe\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2023-10-18T20:22:10Z"}
Copy code
2023/10/18 20:22:10 [32m/go/pkg/mod/gorm.io/gorm@v1.24.1-0.20221019064659-5dd2bb482755/callbacks.go:134
[0m[33m[3.650ms] [34;1m[rows:1][0m INSERT INTO "node_executions" ("created_at","updated_at","deleted_at","execution_project","execution_domain","execution_name","node_id","phase","input_uri","closure","started_at","node_execution_created_at","node_execution_updated_at","duration","node_execution_metadata","parent_id","parent_task_execution_id","error_kind","error_code","cache_status","dynamic_workflow_remote_closure_reference","internal_data") VALUES ('2023-10-18 20:22:10.162','2023-10-18 20:22:10.162',NULL,'flyte-az','development','f42f996b920d8459b9fe','end-node','QUEUED','<abfs://myflytetest/metadata/propeller/flyte-az-development-f42f996b920d8459b9fe/end-node/data/inputs.pb','><binary>',NULL,'2023-10-18 20:22:10.135','2023-10-18 20:22:10.16',0,'<binary>',NULL,NULL,NULL,NULL,NULL,'','<binary>') RETURNING "id"
{"json":{"exec_id":"f42f996b920d8459b9fe","node":"end-node","src":"noop_notifications.go:32"},"level":"debug","msg":"call to noop publish with notification type [flyteidl.admin.NodeExecutionEventRequest] and proto message [event:\u003cid:\u003cnode_id:\"end-node\" execution_id:\u003cproject:\"flyte-az\" domain:\"development\" name:\"f42f996b920d8459b9fe\" \u003e \u003e producer_id:\"propeller\" phase:QUEUED occurred_at:\u003cseconds:1697660530 nanos:135833222 \u003e input_uri:\"<abfs://myflytetest/metadata/propeller/flyte-az-development-f42f996b920d8459b9fe/end-node/data/inputs.pb>\" spec_node_id:\"end-node\" event_version:1 reported_at:\u003cseconds:1697660530 nanos:160164218 \u003e \u003e ]","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","node":"end-node","src":"noop_notifications.go:32"},"level":"debug","msg":"call to noop publish with notification type [flyteidl.admin.NodeExecutionEventRequest] and proto message [event:\u003cid:\u003cnode_id:\"end-node\" execution_id:\u003cproject:\"flyte-az\" domain:\"development\" name:\"f42f996b920d8459b9fe\" \u003e \u003e producer_id:\"propeller\" phase:QUEUED occurred_at:\u003cseconds:1697660530 nanos:135833222 \u003e input_uri:\"<abfs://myflytetest/metadata/propeller/flyte-az-development-f42f996b920d8459b9fe/end-node/data/inputs.pb>\" spec_node_id:\"end-node\" event_version:1 reported_at:\u003cseconds:1697660530 nanos:160164218 \u003e \u003e ]","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","node":"end-node","ns":"flyte-az-development","res_ver":"342094168","routine":"worker-11","src":"executor.go:991","wf":"flyte-az:development:workflows.simple-workflow.simple_workflow"},"level":"debug","msg":"Node pre-execute completed","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","node":"end-node","ns":"flyte-az-development","res_ver":"342094168","routine":"worker-11","src":"executor.go:1174","wf":"flyte-az:development:workflows.simple-workflow.simple_workflow"},"level":"info","msg":"Node was queued, parallelism is now [1]","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","node":"end-node","ns":"flyte-az-development","res_ver":"342094168","routine":"worker-11","src":"executor.go:1176","wf":"flyte-az:development:workflows.simple-workflow.simple_workflow"},"level":"debug","msg":"Completed node [end-node]","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","res_ver":"342094168","routine":"worker-11","src":"executor.go:403","wf":"flyte-az:development:workflows.simple-workflow.simple_workflow"},"level":"info","msg":"Handling Workflow [f42f996b920d8459b9fe] Done","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"passthrough.go:89"},"level":"debug","msg":"Observed FlyteWorkflow Update (maybe finalizer)","ts":"2023-10-18T20:22:10Z"}
{"json":{"src":"controller.go:206"},"level":"info","msg":"Deletion triggered for f42f996b920d8459b9fe","ts":"2023-10-18T20:22:10Z"}

2023/10/18 20:22:10 [32m/go/pkg/mod/gorm.io/gorm@v1.24.1-0.20221019064659-5dd2bb482755/callbacks.go:134
[0m[33m[4.106ms] [34;1m[rows:1][0m INSERT INTO "node_execution_events" ("created_at","updated_at","deleted_at","execution_project","execution_domain","execution_name","node_id","request_id","occurred_at","phase") VALUES ('2023-10-18 20:22:10.166','2023-10-18 20:22:10.166',NULL,'flyte-az','development','f42f996b920d8459b9fe','end-node','','2023-10-18 20:22:10.135','QUEUED') RETURNING "id"
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"passthrough.go:104"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com \"f42f996b920d8459b9fe\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"handler.go:362"},"level":"info","msg":"Completed processing workflow.","ts":"2023-10-18T20:22:10Z"}
E1018 20:22:10.173073       7 workers.go:102] error syncing 'flyte-az-development/f42f996b920d8459b9fe': Operation cannot be fulfilled on flyteworkflows.flyte.lyft.com "f42f996b920d8459b9fe": the object has been modified; please apply your changes to the latest version and try again
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"handler.go:181"},"level":"info","msg":"Processing Workflow.","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"passthrough.go:40"},"level":"warning","msg":"Workflow not found in cache.","ts":"2023-10-18T20:22:10Z"}
{"json":{"exec_id":"f42f996b920d8459b9fe","ns":"flyte-az-development","routine":"worker-11","src":"handler.go:189"},"level":"warning","msg":"Workflow namespace[flyte-az-development]/name[f42f996b920d8459b9fe] not found, may be deleted.","ts":"2023-10-18T20:22:10Z"}
output when querying postgres:
Copy code
select * from node_execution_events where execution_name = 'f42f996b920d8459b9fe';
 id |          created_at           |          updated_at           | deleted_at | execution_project | execution_domain |    execution_name    |  node_id   | request_id |          occurred_at          |   phase   
----+-------------------------------+-------------------------------+------------+-------------------+------------------+----------------------+------------+------------+-------------------------------+-----------
 28 | 2023-10-18 20:22:04.307004+00 | 2023-10-18 20:22:04.307004+00 |            | flyte-az          | development      | f42f996b920d8459b9fe | start-node |            | 2023-10-18 20:22:04.299654+00 | SUCCEEDED
 29 | 2023-10-18 20:22:04.344218+00 | 2023-10-18 20:22:04.344218+00 |            | flyte-az          | development      | f42f996b920d8459b9fe | n0         |            | 2023-10-18 20:22:04.328458+00 | QUEUED
 30 | 2023-10-18 20:22:04.480383+00 | 2023-10-18 20:22:04.480383+00 |            | flyte-az          | development      | f42f996b920d8459b9fe | n0         |            | 2023-10-18 20:22:04.452153+00 | RUNNING
 31 | 2023-10-18 20:22:10.108206+00 | 2023-10-18 20:22:10.108206+00 |            | flyte-az          | development      | f42f996b920d8459b9fe | n0         |            | 2023-10-18 20:22:10.089884+00 | SUCCEEDED
 32 | 2023-10-18 20:22:10.166419+00 | 2023-10-18 20:22:10.166419+00 |            | flyte-az          | development      | f42f996b920d8459b9fe | end-node   |            | 2023-10-18 20:22:10.135833+00 | QUEUED
let me know if specific logs or psql queries will help diagnose the problem
i have similar logs for the sample
wf
in the tutorial
i should note that i have two SAs, one for flyte and one for tasks. they seem to work as expected, given the data is saved to the input/output blobs. but there might be something funky there, given i created the accounts manually
ok, i think this problem is solved. it seems related to two instances of flyte in the same cluster. they lived in different namespaces and seemed to have secure boundaries, but i noticed logs in my deployment citing the other namespace. after deleting the unused instance and namespace pyflyte runs the workflow as expected
y
thanks for figuring it out.
i was at a loss
c
me too, hah! sorry for the run-around. are multiple instances in the same cluster technically feasible? if so, i might dig in a little deeper to understand what was wrong in our setup. if it's not recommended (or not possible), then i'll happily move on, heh
y
maybe technically feasible but definitely an untested path. multiple instances of propeller are common enough, since there is propeller sharding out of the box, but not multiple whole flyte deployments.
c
(and thanks for helping with log configuration. that was the key for my debugging)