Hi Team, While upgrading from `flytekit 1.11.0` t...
# flyte-support
a
Hi Team, While upgrading from
flytekit 1.11.0
to
flytekit 1.14.7
, we encountered an issue with Databricks job runs triggered by Flyte, let me share the details in 🧵
SYMPTHOM
In Spark fast registration workflows,
entrypoint.py
runs into an infinite loop while preparing to run the job.
FINDINGS
We have localized the issue and it's related to this change: https://github.com/flyteorg/flytekit/pull/2765 This line in particular: https://github.com/flyteorg/flytekit/blob/6b166772ab7c1339b6a6dde502b47cae31696d76/plugins/flytekit-spark/flytekitplugins/spark/task.py#L207 With this change, flytekit is taking everything in the current working directory and zip it, the archive file (flyte_wf.zip) is placed in the same directory that is being archived.
shutil.make_archive
creates a zip file and archives the whole directory, including the zip file itself. In the case of Databricks clusters, it triggers a recursion and runs into an endless zipping loop.
RECOMMENDATION
The easiest way to fix this issue to create the archive outside of the directory to be archived, like this:
Copy code
base_dir = tempfile.TemporaryDirectory().name
file_name = "flyte_wf"
file_format = "zip"

shutil.make_archive(f"{base_dir}/{file_name}", file_format, os.getcwd())
This might work, but I see another (cosmetic) problem here.
entrypoint.py
downloads the additional distribution from s3 (in
.tar.gz
format) into the working directory and extracts it, then
shutil.make_archive
takes the whole working directory and creates a ZIP archive. This ZIP archive also contains the original
.tar.gz
file. This probably won't cause any problems, it's just not an elegant solution. As I understand it, it is necessary to archive the working directory, because sparkContext.addPyFile only accepts .py / .zip dependencies. Have you considered using sparkContext.addArchive instead, which supports .zip, .tar, .tar.gz, .tgz and .jar dependencies? I understand that this feature has only been available since Spark 3.3.0, but it does not seem like a major limitation since it has been out for almost 3 years.
@glamorous-carpet-83516 Please let me know your thoughts on this.
@glamorous-carpet-83516 Have you had a chance to take a look at the above? ⬆️
Opened an issue for the same: https://github.com/flyteorg/flyte/issues/6405