aloof-painting-18735
04/02/2025, 12:14 PMflytekit 1.11.0
to flytekit 1.14.7
, we encountered an issue with Databricks job runs triggered by Flyte, let me share the details in 🧵aloof-painting-18735
04/02/2025, 12:17 PMSYMPTHOM
In Spark fast registration workflows, entrypoint.py
runs into an infinite loop while preparing to run the job.
FINDINGS
We have localized the issue and it's related to this change:
https://github.com/flyteorg/flytekit/pull/2765
This line in particular:
https://github.com/flyteorg/flytekit/blob/6b166772ab7c1339b6a6dde502b47cae31696d76/plugins/flytekit-spark/flytekitplugins/spark/task.py#L207
With this change, flytekit is taking everything in the current working directory and zip it, the archive file (flyte_wf.zip) is placed in the same directory that is being archived.aloof-painting-18735
04/02/2025, 12:18 PMshutil.make_archive
creates a zip file and archives the whole directory, including the zip file itself. In the case of Databricks clusters, it triggers a recursion and runs into an endless zipping loop.aloof-painting-18735
04/02/2025, 12:18 PMRECOMMENDATION
The easiest way to fix this issue to create the archive outside of the directory to be archived, like this:aloof-painting-18735
04/02/2025, 12:18 PMbase_dir = tempfile.TemporaryDirectory().name
file_name = "flyte_wf"
file_format = "zip"
shutil.make_archive(f"{base_dir}/{file_name}", file_format, os.getcwd())
aloof-painting-18735
04/02/2025, 12:19 PMentrypoint.py
downloads the additional distribution from s3 (in .tar.gz
format) into the working directory and extracts it, then shutil.make_archive
takes the whole working directory and creates a ZIP archive. This ZIP archive also contains the original .tar.gz
file. This probably won't cause any problems, it's just not an elegant solution.
As I understand it, it is necessary to archive the working directory, because sparkContext.addPyFile only accepts .py / .zip dependencies.
Have you considered using sparkContext.addArchive instead, which supports .zip, .tar, .tar.gz, .tgz and .jar dependencies?
I understand that this feature has only been available since Spark 3.3.0, but it does not seem like a major limitation since it has been out for almost 3 years.aloof-painting-18735
04/02/2025, 12:20 PMaloof-painting-18735
04/07/2025, 7:45 AMaloof-painting-18735
04/08/2025, 11:13 AMaloof-painting-18735
04/15/2025, 9:46 AM