Hi < glamorous carpet 83516> one of my colleagues have been Flyte #flyte-support

Hi <@USU6W5ATA> , one of my colleagues have been t...

billions-midnight-10687

11/16/2023, 1:25 PM

Hi @glamorous-carpet-83516 , one of my colleagues have been testing the built-in databricks plugin in the propeller and found another issue that looks like a bug. When providing a

new_cluster

stanza for the plugin to spin up a new job cluster for the job, he gets a PERMISSION_DENIED error if he does not specify a policy_id, which is as expected (no issue here). However if he do specify “policy_id”: “some policy” to the

new_cluster

definition, the plugin does not even try to send a request to Databricks but just fails. The only error message in the propeller logs that indicates the problem is this

"msg": "Some downstream node has failed. Failed: [true]. TimedOut: [false]. Error: [code:\"\\306\\220\" kind:USER ]"

. I compared both runs and in both cases a flyteworkflow is created in the correct flyte namspace and the only difference is that in the one that fails we have a “policy_id” field. My flyte version is 1.10.5 and I also tried compiling the propeller with your log debug statements to inspect if there is any possible error coming from Databricks but it looks like we never made the API call to Databricks but rather we fail beforehand.

damp-lion-88352

11/16/2023, 2:55 PM

I think it is due to other error. https://github.com/flyteorg/flytekit/pull/1951

damp-lion-88352

11/16/2023, 2:56 PM

in this example, you don't need policy id.

billions-midnight-10687

11/16/2023, 5:02 PM

@damp-lion-88352 can you elaborate please ?

billions-midnight-10687

11/16/2023, 5:02 PM

I am using the built-in databricks webapi plugin in the propeller

damp-lion-88352

11/17/2023, 3:46 PM

https://flyte--4445.org.readthedocs.build/en/4445/deployment/plugins/webapi/databricks.html#deployment-plugin-setup-webapi-databricks

damp-lion-88352

11/17/2023, 3:46 PM

@billions-midnight-10687 Would you like to take a look at this doc?

damp-lion-88352

11/17/2023, 3:46 PM

And list those steps you are not familiar about

damp-lion-88352

11/17/2023, 3:46 PM

I will try to help you.

billions-midnight-10687

11/21/2023, 3:01 PM

i have setup the built-in plugin and it works fine, however a user of flyte had problems submitting a workflow to databricks. When the workflow is registered and run from Flyte (via the UI) the workflow exists and no API call is made to Databricks.

damp-lion-88352

11/21/2023, 3:02 PM

Can you give me an example? Or try to ellaborate more?

damp-lion-88352

11/21/2023, 3:02 PM

Kevin and I will try to help you and enhance the feature

billions-midnight-10687

11/21/2023, 3:03 PM

The only error I was able to find is the the above mentioned error. Yet if the user updates via workflow and removes “policy_id” the workflow is successfully sent to Databricks API

damp-lion-88352

11/21/2023, 3:03 PM

Copy code

curl -X PATCH -n \
-H "Authorization: Bearer <your-personal-access-token>" \
https://<databricks-instance>/api/2.0/workspace-conf \
-d '{
    "enableDcs": "true"
    }'

damp-lion-88352

11/21/2023, 3:04 PM

Have you tried this before?

billions-midnight-10687

11/21/2023, 3:04 PM

I updated the propeller code with some debug statetements that Kevin used to print the request/response from Databricks, but it looks like with the problem workflow we never called databricks.

billions-midnight-10687

11/21/2023, 3:04 PM

1 sec

damp-lion-88352

11/21/2023, 3:04 PM

This is a necessary step

damp-lion-88352

11/21/2023, 3:05 PM

I've updated it in the new documentaion, make it more noticeable

damp-lion-88352

11/21/2023, 3:05 PM

Please tell me the result if you can use it whether use policy id or not

damp-lion-88352

11/21/2023, 3:05 PM

Thank you really much

billions-midnight-10687

11/21/2023, 3:17 PM

we can start DBX clusters with custom containers

damp-lion-88352

11/21/2023, 3:18 PM

yes, does this solve the problem with policy id?

billions-midnight-10687

11/21/2023, 3:18 PM

billions-midnight-10687

11/21/2023, 3:18 PM

it is rather weird to be honest

billions-midnight-10687

11/21/2023, 3:18 PM

I inspected the FlyteWorkflow CRDs

billions-midnight-10687

11/21/2023, 3:19 PM

and they look ok to me, I mean flyte can create the CRDs for each submitted worflow

billions-midnight-10687

11/21/2023, 3:19 PM

but somehow when the DBX payload has policy_id in it, flyte does not start the payload and just throws the above mentioned error

billions-midnight-10687

11/21/2023, 3:20 PM

on another note

billions-midnight-10687

11/21/2023, 3:20 PM

the same user has problems using pyflyte (the user has used flyte before so he is experienced in it)

damp-lion-88352

11/21/2023, 3:20 PM

Thank you very much

billions-midnight-10687

11/21/2023, 3:20 PM

Copy code

Failed with Exception Code: SYSTEM:Unknown
Underlying Exception: [Errno 98] Address already in use
53593

damp-lion-88352

11/21/2023, 3:20 PM

can you provide a python example for us to debug?

damp-lion-88352

11/21/2023, 3:21 PM

Kevin and I will help in this week

billions-midnight-10687

11/21/2023, 3:21 PM

our flyte is configured with Okta for OIDC and built-in OAuth server

billions-midnight-10687

11/21/2023, 3:21 PM

billions-midnight-10687

11/21/2023, 3:21 PM

1 sec

🙏 1

billions-midnight-10687

11/21/2023, 3:24 PM

Copy code

import datetime
import random
from operator import add
import flytekit
from flytekit import Resources, task, workflow
from flytekitplugins.spark.task import Databricks

@task(
    task_config=Databricks(
       databricks_conf={
         "run_name" : "flytekit databricks plugin example",
         "timeout_seconds" : 3600,
         "new_cluster" : {
            "num_workers": 2,
            "spark_version": "12.2.x-scala2.12",
            "spark_conf": {
                "spark.hadoop.fs.s3a.server-side-encryption-algorithm": "AES256",
                "spark.driver.extraJavaOptions": "-Dlog4j2.formatMsgNoLookups=true",
                "spark.executor.extraJavaOptions": "-Dlog4j2.formatMsgNoLookups=true"
            },
            "aws_attributes": {
                "first_on_demand": 1,
                "availability": "SPOT_WITH_FALLBACK",
                "zone_id": "auto",
                "instance_profile_arn": "arn:aws:iam::<account>:instance-profile/<profile>",
                "spot_bid_price_percent": 100,
                "ebs_volume_count": 0
            },
            "policy_id": "<policy_id>",
            "node_type_id": "m5d.large",
            "ssh_public_keys": [],
            "custom_tags": {
            ...
            },
            "cluster_log_conf": {
                "s3": {
                    "destination": "s3://<bucket>/cluster-logs",
                    "region": "us-east-1",
                    "enable_encryption": "true",
                    "canned_acl": "bucket-owner-full-control"
                }
            },
            "spark_env_vars": {
                "LIMITED_INTERNET_ACCESS": "false"
            },
            "enable_elastic_disk": "true",
            "init_scripts": []
         }
       },
       databricks_instance="<instance>",
       databricks_token="<token>",
       applications_path="s3://<bucket>/entrypoint.py"
    ),
    limits=Resources(mem="2000M"),
    cache_version="1",
)
def print_spark_config():
    spark = flytekit.current_context().spark_session
    print(spark.sparkContext.getConf().getAll())

@workflow
def my_db_job():
    print_spark_config()

damp-lion-88352

11/21/2023, 3:25 PM

This will be really helpful

damp-lion-88352

11/21/2023, 3:25 PM

Thank you very much again.

billions-midnight-10687

11/21/2023, 3:25 PM

so with this workflow, flyte returns the above error. I don’t see any communication back and forth to Databricks

billions-midnight-10687

11/21/2023, 3:25 PM

if we remove he policy_id, the job is sent to databricks

billions-midnight-10687

11/21/2023, 3:25 PM

this does not make any sense to me

billions-midnight-10687

11/21/2023, 3:25 PM

in both cases the FlyteWorkflow CRD is created

damp-lion-88352

11/21/2023, 3:26 PM

We will spend time deep dive to it, thanks a lot

billions-midnight-10687

11/21/2023, 3:26 PM

thanks

billions-midnight-10687

11/21/2023, 3:27 PM

and this is how the user runs it

billions-midnight-10687

11/21/2023, 3:27 PM

Copy code

pyflyte -k db_plugin package --fast -d "/databricks/driver" --image <custom_databricks_runner_image> --force --output=db_plugin.tgz    
flytectl register files --project flytesnacks --domain development --archive db_plugin.tgz --version v21 --destinationDirectory "/databricks/driver"

billions-midnight-10687

11/21/2023, 3:28 PM

there is a reason why he is not running it directly with “pyflyte run”

billions-midnight-10687

11/21/2023, 3:28 PM

1. he wants to run it like a production workflow (following the recommended pattern)

damp-lion-88352

11/21/2023, 3:28 PM

you can use pyflyte register instead

billions-midnight-10687

11/21/2023, 3:29 PM

2. second, he is not able to actually use “pyflyte run”, because what happens with him, is that pyflyte gets two callback responses (from the built-in OAuth server) and tries to bind twice to localhost on the same port and it fails

billions-midnight-10687

11/21/2023, 3:29 PM

Copy code

Failed with Exception Code: SYSTEM:Unknown
Underlying Exception: [Errno 98] Address already in use
53593

billions-midnight-10687

11/21/2023, 3:29 PM

again, I used to face the same problem, but it disappeared for me ( I did not do anything )

billions-midnight-10687

11/21/2023, 3:30 PM

I don’t know why this happens to him, but he found out that by using these two commands he is able to register the workflow and then run it from the console

🙏 1

glamorous-carpet-83516

11/22/2023, 11:56 AM

@billions-midnight-10687 I found the bug, will create a pr today. will ping you once I open a PR

billions-midnight-10687

11/22/2023, 5:00 PM

that’s about the double auth ?

billions-midnight-10687

11/22/2023, 5:13 PM

@damp-lion-88352

damp-lion-88352

11/22/2023, 11:18 PM

I don’t know the bug yet, will wake for Kevin’s update and double check with him

damp-lion-88352

11/22/2023, 11:18 PM

Wait for

damp-lion-88352

11/22/2023, 11:18 PM

Thank for your patience

glamorous-carpet-83516

11/22/2023, 11:20 PM

Not double auth. About policy ID

faint-piano-90133

11/23/2023, 3:12 AM

Hi Kevin Su and L godlike, I am the colleague that Georgi was referring to in his message above. It would be great if you can help check the double auth issue with pyflyte register and pyflyte run as well from your side

glamorous-carpet-83516

11/24/2023, 1:52 AM

@billions-midnight-10687 could you give it a try

glamorous-carpet-83516

11/24/2023, 1:52 AM

https://github.com/flyteorg/flyte/pull/4477

🙌 1

glamorous-carpet-83516

11/24/2023, 1:53 AM

I built a image for propeller.

pingsutw/debug-dbx:v2

👍 1

glamorous-carpet-83516

11/24/2023, 1:54 AM

There was a panic in the dbx plugin, so you didn’t see any error message.

22 Views

Open in Slack

Previous Next