Hi Everyone.. I'm encountering an issue with Flyte...
# flyte-support
l
Hi Everyone.. I'm encountering an issue with Flyte and was hoping to get some help. When I run a task that receives a large dictionary (around 300 KB), the task gets stuck in the "running" state, even though my logic has completed. Additionally, the pod running the task also remains in the "running" state. Has anyone else experienced something similar, or does anyone have any insights into what might be causing this? Thank you!
t
300KB isn’t that big… it should be fine. we typically don’t see issues before the 1mb area. could you check to see if the task pod is still running?
also are there any logs in the flyte/propeller container?
are you running the flyte-core helm chart or the flyte-binary helm chart?
l
Hey! So the task pod is still running. I am running the flyte-binary chart.
The only error log I see in flyte is :"Failed to Cache the metadata, caused by: The entry size is larger than 1/1024 of cache size"
I also see the warning log "Failed to cast contentMD5 [] to string"
It happens only when i send big dictionaries (around 300 KB)
Are there specific logs I should be looking for?
t
apologies but could you clarify what’s happening some more please? when you say “my logic has completed”, you mean the logic of the task that is receiving the 300kb input right? and how do you know the logic has completed? through logging?
l
Yes, I mean the logic that receives the input and I know it has completed through logging
t
what are the outputs of this task?
l
The output is pydantic model but I've tried many outputs include 'None' and I get the same result
additionally, my task is memory-intensive and uses a significant amount of RAM during execution
t
and you can confirm you’re receiving/using this 300kb map correctly?
what’s in the map? is it just primitives? or are there offloaded data types (like files/dataframes)?
l
It is a dictionary of strings
What do you mean by correctly?
t
like it’s not garbled, not incomplete, etc. it’s the correct length,
can you run with
FLYTE_SDK_LOGGING_LEVEL=10
? that will at least increase the verbosity on the flytekit side so we can see at what step it’s getting stuck.
l
The input is correct
I will try logging flag. Thank you!
t
basically after input is read it’s really not used at all except in user code. except for taking up memory (and 300kb isn’t that much even if you assume it takes 30mb to represent 300kb for some crazy reason), it should have no impact on the rest of the task code.
it definitely shouldn’t cause things to hang.
h
We experienced this today as well where everything looked like it was going smoothly except that one of our dynamic tasks was hung in running state though the subtasks succeeded. Propeller had logs had "Failed to cast contentMD5 [] to string". After aborting manually, the UI showed that some tasks failed because they were out of disk. I don't know for sure if they were the hanging dynamic task, I was pretty sure all tasks were either succeeded, or the one hanging in running with all subtasks succeeded, or not started. Will pay more attention if it happens again
l
I run my task with FLYTE_SDK_LOGGING_LEVEL=10 and there is no log that describes what step is getting stuck. Actually there are no logs of flyte at all after my logic finishes, includes on successful tasks with small input 😞
t
can you copy paste the entire log (redact whatever you need)
l
Copy code
{"asctime": "2024-08-10 19:55:11,894", "name": "flytekit", "levelname": "INFO", "message": "Execute user level code. [Time: 17.681243s]", "taskName": null}
{"asctime": "2024-08-10 19:55:11,895", "name": "flytekit", "levelname": "DEBUG", "message": "Invalid base type typing.Union in call to isinstance", "exc_info": "Traceback (most recent call last):\n  File \"/root/micromamba/envs/dev/lib/python3.12/site-packages/flytekit/core/type_engine.py\", line 1039, in get_transformer\n    if isinstance(python_type, origin_type) or (  # type: ignore[arg-type]\n       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/root/micromamba/envs/dev/lib/python3.12/typing.py\", line 510, in __instancecheck__\n    raise TypeError(f\"{self} cannot be used with isinstance()\")\nTypeError: typing.Union cannot be used with isinstance()", "taskName": null}
{"asctime": "2024-08-10 19:55:11,896", "name": "flytekit", "levelname": "DEBUG", "message": "Invalid base type <function NamedTuple at 0x7fffff149da0> in call to isinstance", "exc_info": "Traceback (most recent call last):\n  File \"/root/micromamba/envs/dev/lib/python3.12/site-packages/flytekit/core/type_engine.py\", line 1039, in get_transformer\n    if isinstance(python_type, origin_type) or (  # type: ignore[arg-type]\n       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: isinstance() arg 2 must be a type, a tuple of types, or a union", "taskName": null}
{"asctime": "2024-08-10 19:55:11,900", "name": "flytekit", "levelname": "DEBUG", "message": "Detected file /tmp/flytefvwe0k7e/local_flytekit/ce5f7f90281303e923fad55e58834000, call put non-recursive", "taskName": null}
{"asctime": "2024-08-10 19:55:12,061", "name": "flytekit", "levelname": "INFO", "message": "Translate the output to literals. [Time: 0.166509s]", "taskName": null}
{"asctime": "2024-08-10 19:55:12,062", "name": "flytekit", "levelname": "DEBUG", "message": "Adding trailing sep to", "taskName": null}
{"asctime": "2024-08-10 19:55:12,070", "name": "flytekit", "levelname": "INFO", "message": "Upload data to <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f1834b28fef004897b22/n0/data/0/dn3/0>. [Time: 0.008056s]", "taskName": null}
{"asctime": "2024-08-10 19:55:12,070", "name": "flytekit", "levelname": "INFO", "message": "Engine folder written successfully to the output prefix <s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f1834b28fef004897b22/n0/data/0/dn3/0>", "taskName": null}
{"asctime": "2024-08-10 19:55:12,070", "name": "flytekit", "levelname": "DEBUG", "message": "Finished _dispatch_execute", "taskName": null}
Here are the logs. Thank you!
t
can you ls the contents of
<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f1834b28fef004897b22/n0/data/0/dn3/0>
please? Also are these logs roughly the same as the smaller case that works?