Hi team, spark question here... We're having issue...
# ask-the-community
a
Hi team, spark question here... We're having issue with some spark applications not cleared after completing/failing. This leads to 1,000s of "active" applications and aggregate size of env vars for spark pods exceeding ARG_MAX limit (as there are multiple vars or each exec/driver pod) and all jobs start failing. Whose responsibility is it to clear completed/failed applications, spark operator or propeller?
in case it helps:
Copy code
fieldsV1:
      f:status:
        .:
        f:applicationState:
          .:
          f:state:
        f:driverInfo:
          .:
          f:podName:
          f:webUIAddress:
          f:webUIPort:
          f:webUIServiceName:
        f:executionAttempts:
        f:executorState:
          .:
          f:a7hhmd24d6hgw776vkfv-n0-0-exec-1:
        f:lastSubmissionAttemptTime:
        f:sparkApplicationId:
        f:submissionAttempts:
        f:submissionID:
        f:terminationTime:
    Manager:    spark-operator
    Operation:  Update
    Time:       2022-12-01T02:11:05Z
this one completed 20hrs ago, but the app is still listed
k
Hmm the failed application should be cleared by sparkoperator. and flyte should clear the workflow and everything after the GC interval
a
@Ketan (kumare3) do you know how this env vars get added? I don't see these in app/pod specs, so I guess they are created in runtime
Copy code
FGXQ2R3ISZJ2FC_N3_0_N5_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
FMFX6ESKONI6FO_N3_0_N4_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
FTIBWPHPXO32XE_N3_0_N2_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
F6K4ORSNOMFTLY_N3_0_N2_0_UI_SVC_PORT_4040_TCP_ADDR=172.20.2.240
FRUWTUSAANBRRA_N3_0_N3_0_UI_SVC_SERVICE_HOST=172.20.171.76
FE3RGZTYAPR14K_N3_0_N2_0_UI_SVC_PORT_4040_TCP=<tcp://172.20.86.57:4040>
FK5DWJ1IOU6QWC_N3_0_N4_0_UI_SVC_SERVICE_PORT=4040
FGX66OT2O1U4OW_N3_0_N3_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
Hmm the failed application should be cleared by sparkoperator. and flyte should clear the workflow and everything after the GC interval
for some reason we had 4.5K of both completed and failed just hanging there. I had to delete them using kubectl and then restart spark operator, flyte admin and propeller to clear the state. What is odd, spark wouldn't recover until after I restarted admin and propeller. Idk if both were necessary though, as I restarted them at the same time
Seeing the same again. Completed/failed spark tasks are not cleared. I think this started to happen after we upgraded to the latest propeller
same for pods - Completed but not removed
k
Cc @Haytham Abuelfutuh want to look at this?
d
Hi @Alex Pozimenko, wondering if this is still an issue? In such case, which version of flytepropeller are you using?
a
@David Espejo (he/him) we updated spark config to remove majority of the env vars. This should resolve the problem (haven't tested it at scale yet as we're moving away from spark)
101 Views