Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi team, spark question here... We're having issue with some spark applications not cleared after completing/failing. This leads to 1,000s of "active" applications and aggregate size of env vars for spark pods exceeding ARG_MAX limit (as there are multiple vars or each exec/driver pod) and all jobs start failing.
Whose responsibility is it to clear completed/failed applications, spark operator or propeller?

in case it helps:
```fieldsV1:
      f:status:
        .:
        f:applicationState:
          .:
          f:state:
        f:driverInfo:
          .:
          f:podName:
          f:webUIAddress:
          f:webUIPort:
          f:webUIServiceName:
        f:executionAttempts:
        f:executorState:
          .:
          f:a7hhmd24d6hgw776vkfv-n0-0-exec-1:
        f:lastSubmissionAttemptTime:
        f:sparkApplicationId:
        f:submissionAttempts:
        f:submissionID:
        f:terminationTime:
    Manager:    spark-operator
    Operation:  Update
    Time:       2022-12-01T02:11:05Z```

this one completed 20hrs ago, but the app is still listed

Hmm the failed application should be cleared by sparkoperator. and flyte should clear the workflow and everything after the GC interval

<@UNZB4NW3S> do you know how this env vars get added? I don't see these in app/pod specs, so I guess they are created in runtime
```FGXQ2R3ISZJ2FC_N3_0_N5_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
FMFX6ESKONI6FO_N3_0_N4_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
FTIBWPHPXO32XE_N3_0_N2_0_UI_SVC_PORT_4040_TCP_PROTO=tcp
F6K4ORSNOMFTLY_N3_0_N2_0_UI_SVC_PORT_4040_TCP_ADDR=172.20.2.240
FRUWTUSAANBRRA_N3_0_N3_0_UI_SVC_SERVICE_HOST=172.20.171.76
FE3RGZTYAPR14K_N3_0_N2_0_UI_SVC_PORT_4040_TCP=<tcp://172.20.86.57:4040>
FK5DWJ1IOU6QWC_N3_0_N4_0_UI_SVC_SERVICE_PORT=4040
FGX66OT2O1U4OW_N3_0_N3_0_UI_SVC_PORT_4040_TCP_PROTO=tcp```

&gt; Hmm the failed application should be cleared by sparkoperator. and flyte should clear the workflow and everything after the GC interval
for some reason we had 4.5K of both completed and failed just hanging there. I had to delete them using kubectl and then restart spark operator, flyte admin and propeller to clear the state.
What is odd, spark wouldn't recover until after I restarted admin and propeller. Idk if both were necessary though, as I restarted them at the same time

Seeing the same again. Completed/failed spark tasks are not cleared. I think this started to happen after we upgraded to the latest propeller

same for pods - Completed but not removed

Hi <@U029E6ZLZ9S>, wondering if this is still an issue?
In such case, which version of flytepropeller are you using?

<@U04H6UUE78B> we updated spark config to remove majority of the env vars. This should resolve the problem (haven't tested it at scale yet as we're moving away from spark)