Hi, We are using the databricks built-in plugin in...
# ask-the-community
a
Hi, We are using the databricks built-in plugin in our shop and there seems to be issue with applications_path parameter passed in the databricks workflow config. We are trying to override the entrypoint file location using applications_path but the overridden value is not being picked up and always defaults to the value in the plugin config. Looks like we can no longer override the entrypoint via config from the workflow. It is currently set to whatever is set in the configuration (server-side)
k
Would need more info. This is not using agent yet I assume
a
No this is not using the agent
this is using the databricks built-in plugin
k
you still can override the application path. here is an example:
Copy code
@task(
    task_config=Spark(
        # this configuration is applied to the spark cluster
        spark_conf={
            "spark.driver.memory": "1000M",
            "spark.executor.memory": "1000M",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.driver.cores": "1",
        },
        executor_path="/usr/bin/python3",
        applications_path="local:///usr/local/bin/entrypoint.py",
    ),
    limits=Resources(mem="2000M"),
    cache_version="1",
    container_image=spark_image,
)
def hello_spark(partitions: int) -> float:
a
@Kevin Su we are trying to do the override on the databricks workflow config because we want to run our tasks on our Databricks instances
Copy code
@task(
    task_config=Databricks(
        # this configuration is applied to the spark cluster
        spark_conf={
            "spark.driver.memory": "1000M",
            "spark.executor.memory": "1000M",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.driver.cores": "1",
        },
        databricks_conf={
          ...
        },
        applications_path="dbfs:///Filestore/tables/entrypoint.py",
    ),
    limits=Resources(mem="2000M"),
    cache_version="1"
)
def hello_spark(partitions: int) -> float:
The entrypoint override through applications_path is not being picked up and is always using the location specified in the plugin config (values.yaml)
We have done it multiple times and trust me the applications_path provided in the databricks workflow config is simply being ignored
k
for the backend plugin. instead of setting the applications_path in the databricks config. you need to set the path in the propeller config.
Copy code
plugins:
  databricks:
    entrypointFile: dbfs:///FileStore/tables/entrypoint.py
    databricksInstance: <DATABRICKS_ACCOUNT>.<http://cloud.databricks.com|cloud.databricks.com>
  k8s:
    default-env-vars:
      - FLYTE_AWS_ACCESS_KEY_ID: <AWS_ACCESS_KEY_ID>
      - FLYTE_AWS_SECRET_ACCESS_KEY: <AWS_SECRET_ACCESS_KEY>
      - AWS_DEFAULT_REGION: <AWS_REGION>
a
Yeah @Kevin Su understand that but there is an issue using the recommended entrypoint script between two different databricks runtimes. It looks like the entrypoint script provided in the flyte documentation works with the latest databricks runtimes (version 11 and above) but it throws an on older runtimes like 10.4 so for 10.4 we had to switch using the older version of your entrypoint script. So, for us to switch between different versions of entrypoint script we wanted to use the applications_path parameter so we can override the entrypoint script path
I haven't look too deep into the entrypoint script so not sure why the entrypoint script is working differently between different databricks run times
k
This is the entrypoint file. https://github.com/flyteorg/flytetools/commit/aff8a9f2adbf5deda81d36d59a0b8fa3b1fc3679 it allows flyte to use different command (pyflyte-execute) to run a spark job on Databricks platform.
What’s the error?
a
yeah I think this is the entrypoint file we are using now which works fine with databricks run time version 12.2 but with databricks run time version 10.4 with this entrypoint script we were getting weird module import error. There were errors importing modules that were part of the application source code and switching to the older entrypoint script (like 11 months ago) worked with 10.4
I am not able to share the exact the error message here because its been quite some time but either way is it possible to enable the applications_path override parameter through databricks config which could be beneficial in the future for what its worth.
k
you are using pyflyte run?
a
we are doing pyflyte package + flytectl register
we are also doing --fast packaging
k
There are some issues of using fast register with spark task
have you tried to use non-fast register spark task
a
We can try that
for fast packaging we are providing the destination dir (/databricks/driver) so flyte knows where to inject the source code at run time but I don't think that is needed for non-fast and we just have to make sure the application source code is copied to the /databricks/driver directory path in the image. Right?
s
yes, that should work.