Hi folks ! I'm just getting started with Flyte and...
# ask-the-community
f
Hi folks ! I'm just getting started with Flyte and followed the https://github.com/davidmirror-ops/flyte-the-hard-way docs to get myself setup with a simple single cluster deployment on top of EKS. However when i'm trying to run a workflow (the wine dataset example), i faced the following issues: 1. Task stuck in Running/queued state : For some reason after running the workflow, the first
get_data
task seemed stuck in Running. I noticed that the node group had nodes each with 2 CPUs so I ended up updating the nodes to run on 4 CPUs each. This "i think" got me out of the queued state, but now the task was failing - leading to the next issue 2. Task failing with
died with <Signals.SIGKILL: 9>
error : This was the log for the failed task. I searched some slack threads and someone mentioned that this error might be happening due to OOM but not sure if that's the case here as each node had 16GB of memory. Isn't that sufficient ?
Copy code
[1/1] currentAttempt done. Last Error: USER::                                                     │
│ ❱  760 │   │   │   │   return __callback(*args, **kwargs)                    │
│                                                                              │
│ /usr/local/lib/python3.10/site-packages/flytekit/bin/entrypoint.py:508 in    │
│ fast_execute_task_cmd                                                        │
│                                                                              │
│ ❱ 508 │   subprocess.run(cmd, check=True)                                    │
│                                                                              │
│ /usr/local/lib/python3.10/subprocess.py:526 in run                           │
│                                                                              │
│ ❱  526 │   │   │   raise CalledProcessError(retcode, process.args,           │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['pyflyte-execute', '--inputs', 
'<s3://flyte-cluster-bucket-2023/metadata/propeller/flytesnacks-development-ff067>
d646b0684b76a94/n0/data/inputs.pb', '--output-prefix', 
'<s3://flyte-cluster-bucket-2023/metadata/propeller/flytesnacks-development-ff067>
d646b0684b76a94/n0/data/0', '--raw-output-data-prefix', 
'<s3://flyte-cluster-bucket-2023/data/2b/ff067d646b0684b76a94-n0-0>', 
'--checkpoint-path', 
'<s3://flyte-cluster-bucket-2023/data/2b/ff067d646b0684b76a94-n0-0/_flytecheckpoi>
nts', '--prev-checkpoint', '""', '--dynamic-addl-distro', 
'<s3://flyte-cluster-bucket-2023/flytesnacks/development/4MOWXYYMZXUPWCJJKGSQ6EOI>
24======/script_mode.tar.gz', '--dynamic-dest-dir', '/root', '--resolver', 
'flytekit.core.python_auto_container.default_task_resolver', '--', 
'task-module', 'example', 'task-name', 'get_data']' died with <Signals.SIGKILL: 
9>.
Can someone please help me out ?
d
Hi @Faisal Anees welcome to the Flyte community and I'm glad you find the guide useful to bootstrap a cluster. Can we check task pod status and resources? Maybe
kubectl get po -n flytesnacks-development
and then
kubectl describe po <your-task-pod-name> -n flytesnacks-development
s
@Faisal Anees, are you requesting 16GB mem?
f
Update : @Samhita Alla i wasn't actually requesting anything on a task. I added this resource request to the task and it ran successfully ! 😄
Copy code
@task(requests=Resources(cpu="1", mem="500Mi"), limits=Resources(cpu="2", mem="800Mi"))
Also thanks @David Espejo (he/him) for the amazing guide ! I think I spent the better of last 2 days going over the flyte docs (sometimes over older versions just to get some hints) to setup a cluster but your guide finally got me flyte running 🙏