Hey there, I'm having regular crashes when using m...
# flyte-support
q
Hey there, I'm having regular crashes when using map_task in Flyte. I have a map_task that I use to train multiple LightGBM models for hyperparam tuning purposes. Most of the jobs work, but ocasionally a job randomly fails, getting the error code 241:
Copy code
[7]: [1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.
[atkkprsz99kkln8f4rqp-fpahmm6y-0] terminated with exit code (241). Reason [Error]. Message: 
369Z","level":"info","message":"Model fitting input shape: Samples 364252 x Features 189"}
It seems flaky, as resetting the failed job with the same inputs will often work. Do you know what might be causing this? Thanks
c
do you have maybe metrics from your K8s cluster? It might be that this happens in situations of resource contention
q
What metrics should I look out for? Forgot to mention also that in this cluster I'm using AWS Spot nodes (with fallback to on-demand nodes using karpenter). In another cluster, we use on-demand Nodes and no crash happens, could that be related?