Can anyone help me debug why my task is stuck in Q...
# flyte-support
a
Can anyone help me debug why my task is stuck in Queueing state? When I look at the status of the pods it says they have Completed - including the pod corresponding to the task that is stuck Queueing in the dashboard - but the next tasks do not kick off.
Looking at the logs for the flytepropeller pod, I'm seeing a lot of warnings about a "futures file" not existing...?
t
this is a bit strange. can you take a look at the logs for the parent pod?
so a dynamic task will first kick off a pod for the dynamic task, but instead of returning values, it returns a subworkflow instead.
👍 1
this is the job of the first pod that it launches.
👍 1
that workflow is a
DynamicJobSpec
object but basically it looks just like a workflow
👍 1
that’s what’s supposed to be put into the futures.pb file.
👍 1
feels like there’s a bug somewhere. if the file didn’t generate i don’t think propeller should be looking for it
can you check the contents of that directory? can you
s3 ls
that folder where the futures file is supposed to be?
f
What version are you running
a
@thankful-minister-83577 Using kubectl logs on the parent pod does not show anything out of the ordinary. Log contents attached. I am using minio as the storage. Attached is the screenshot of minio - it definitely creates that futures.pb file at some point after it complains. Also attached is the full output of the flytepropeller pod logs - including the futures file complaint. Thanks for taking a look at this
@freezing-airport-6809 For flytekit? 1.2.6 For flytectl? 0.6.25
f
Ohh backend is older
👀 1
But we have not seen this error
a
It still says "Queued" on the dashboard even after the pipeline aborts due to "max number of system retry attempts [31/30] exhausted"
f
@ambitious-australia-27749 can you share a reproducible copy of the workflow that I can try tomorrow?
a
Yes! I will get it to you in a few hours
g
@hallowed-mouse-14616 I think it’s probably related the issue we discussed. Failed to create the resource, but the job status is running forever.
t
Copy code
logger:
      show-source: true
      level: 6
actually @ambitious-australia-27749 can you re-run again now that the logging level has been bumped
👍 1
and then send over those logs?
the old futures file is fine, i don’t think that’s changing
a
@thankful-minister-83577 Just reran the workflow and attached the logs from flyte propeller with log level 6. I also attached both futures files - log level 6 and regular log level (just in case they aren't the same)
(delaieine is the name of our project fysa - just so its easier to keep track of)
t
@hallowed-mouse-14616 over yonder here
🤣 1
h
@ambitious-australia-27749 so parsing through the logs here there may be multiple issues we need to address: (1) For the workflow you linked ^^ (ie.
aq4dj4xctvd84df9cvqm
) the error message in the logs is:
Copy code
{
  "json": {
    "exec_id": "aq4dj4xctvd84df9cvqm",
    "node": "n5/dn0",
    "ns": "delaieine-development",
    "res_ver": "266883",
    "routine": "worker-3",
    "wf": "delaieine:development:flyte.workflows.auto_train.pipeline"
  },
  "level": "error",
  "msg": "handling parent node failed with error: InvalidArgument: Invalid fields for event message, caused by [rpc error: code = InvalidArgument desc = missing project]",
  "ts": "2023-01-19T01:26:46Z"
}
This shows that propeller is failing to send a message to admin because of a 'missing project'. There may be some kind of version mismatch between propeller and admin - do you know what versions you're running? (2) The
Failed to read futures file
errors are printed out for other workflows (ie. not the one depicted). It looks like Flyte is trying to abort the workflow but is failing to abort. ex:
Copy code
{
  "json": {
    "exec_id": "a8h2qqfdkxtqzhg49g22",
    "node": "n5",
    "ns": "delaieine-development",
    "res_ver": "271443",
    "routine": "worker-1",
    "wf": "delaieine:development:flyte.workflows.auto_train.pipeline"
  },
  "level": "warning",
  "msg": "Failed to read futures file. Error: path:<s3://my-s3-bucket/metadata/propeller/delaieine-development-a8h2qqfdkxtqzhg49g22/n5/data/0/futures.pb>: not found",
  "ts": "2023-01-19T01:50:13Z"
}
followed by:
Copy code
{
  "json": {
    "exec_id": "a8h2qqfdkxtqzhg49g22",
    "ns": "delaieine-development",
    "res_ver": "271443",
    "routine": "worker-1",
    "wf": "delaieine:development:flyte.workflows.auto_train.pipeline"
  },
  "level": "error",
  "msg": "Failed to propagate Abort for workflow:project:\"delaieine\" domain:\"development\" name:\"a8h2qqfdkxtqzhg49g22\" . Error: []",
  "ts": "2023-01-19T01:50:13Z"
}
Somehow the
futures.pb
file is missing. So either (1) it was generated and deleted, corrupt, etc or (2) Flyte is looking for the file when it shouldn't be - this may be related to the event issue above.
a
Hi @hallowed-mouse-14616! Thanks for taking a look at this. I saw that "missing project" error it's quite odd considering I definitely have a project created. I'm using: flytekit=1.2.6 flytectl=0.6.25 I'm not sure how to determine which exact version of flytepropeller I have.
For (2) When walking through with Yee we definitely saw it creating and keeping the futures.pb file and we used flyte-cli parse-proto to parse it and it appeared to be fine. Very strange! An example of the futures.pb file created when running this workflow is here: https://flyte-org.slack.com/archives/CP2HDHKE1/p1674254736681989?thread_ts=1674085228.291519&amp;cid=CP2HDHKE1
@hallowed-mouse-14616 Hey Dan! Sorry for not following up on this until now - My team and I had a bunch of end-of-the-year performance deliverables that took up most of February. We never solved this issue with queuing, unfortunately. I'd love to pick this back up and try to solve this issue. @ing my colleague @high-analyst-91624 on this as well since he'll be working more on this part of our project.
h
@ambitious-australia-27749 no problem! And of course, lets get this resolved. Do you have a reproducible workflow that this happens on? Re-reading through the messages here I'm wondering if it makes sense to work from the beginning on this. If it's reproducible, we can execute and walk through step by step.
a
@hallowed-mouse-14616 Thanks so much! We really appreciate your help. Let me make a small reproducible workflow and get back to you this week. Our current code & data can't be shared directly, unfortunately.
h
Sounds great. Looking forward to it!
591 Views