Hello I need an existing process that runs in databricks not Flyte #flyte-support

Hello, I need an existing process that runs in dat...

salmon-refrigerator-32115

04/18/2024, 11:49 PM

Hello, I need an existing process that runs in databricks notebook that reads feature data from a Databricks Catalog table (e.g. bolt_feature_engg_prod.gold.audience_segments_feature_set) using SQL:

Copy code

input_data_query = f'select ... FROM bolt_feature_engg_prod.gold.audience_segments_feature_set where ...'
input_data = spark.sql(input_data_query)
feature_data=input_data.toPandas()

I need to convert the Databricks process to a Flyte workflow and I would like to re-use the SQL query because it is a very complicated query. Also because my upstream to produce the feature data cannot be changed and must be stored in the same Databricks Catalog table. How can I do that? Thanks!

proud-answer-87162

04/19/2024, 1:18 AM

is your question specifically about interacting with databricks within a flyte

Task

? if so, you can use the python databricks SDK to run arbitrary queries that you have already defined:

Copy code

"""
Doc's found here. <https://learn.microsoft.com/en-us/azure/databricks/dev-tools/python-sql-connector>
"""
from databricks import sql
stmt = "SELECT ..."
    with sql.connect(server_hostname=server_hostname,
                     http_path=http_path,
                     access_token=access_token,
                     **kwargs) as connection:
        with connection.cursor() as cursor:
            cursor.execute(stmt)

            return cursor.fetchall()

salmon-refrigerator-32115

04/19/2024, 3:34 AM

Hi @proud-answer-87162, Thank you for the quick response. Do you think it will work in non-azure environment? Our Databricks cluster is hosted in AWS.

proud-answer-87162

04/19/2024, 1:01 PM

@salmon-refrigerator-32115 i assume so - the sdk isn't azure specific (only the doc i linked to). there are several other python SDKs that are offered as well, but some require Unity Catalog to be enabled in your workspace

salmon-refrigerator-32115

04/21/2024, 8:52 PM

Hi @proud-answer-87162, I tried it and it works with small dataset. Thanks! However, I have large dataset about at least 7 million records. Reading is OK although slow. But saving the data about that size back to databricks unity catalog table took way too long and won’t finish. How you you solve this issue? And do you know how to connect Spark jobs running in EKS to databricks? I need to run Spark jobs from external EKS. Spark should save large dataset to databricks. I just don’t know the connnectivity park.

proud-answer-87162

04/22/2024, 2:23 PM

we solve this problem by staging the data in a mount that databricks has access to; we write parquet files that match the destination to an azure blob store container but this can be done several ways. it's also important to understand how partitions are used when inserting data.

salmon-refrigerator-32115

04/22/2024, 3:45 PM

Thanks @proud-answer-87162, I will take a look

👍 1

salmon-refrigerator-32115

05/04/2024, 12:07 AM

Hi @proud-answer-87162, The stage link above https://www.confessionsofadataguy.com/lessons-learned-from-merge-operations-with-billions-of-records-on-databricks-spark/ you provided is for MERGE INTO … Is this what you really wanted to share for staging data in a mount?

salmon-refrigerator-32115

05/04/2024, 12:08 AM

Have a nice weekend!

proud-answer-87162

05/06/2024, 3:29 PM

@salmon-refrigerator-32115 the merge into is the sql syntax but you'll notice that it references a mount location with

USING some_staging_table as S

proud-answer-87162

05/06/2024, 3:30 PM

the actual staging of the data from your app's perspective is just pushing to the azure blob storage container

proud-answer-87162

05/06/2024, 3:30 PM

then you configure your dbricks workspace to use that container as a mount

salmon-refrigerator-32115

05/06/2024, 5:07 PM

Hi @proud-answer-87162, I see. Thank you for the explanation. However, we don’t use Azure. We use AWS. What do you think is the equivalence of Azure Blob storage container in AWS? Do you think AWS S3 will work?

proud-answer-87162

05/06/2024, 5:08 PM

yup, s3 should be fine

salmon-refrigerator-32115

05/06/2024, 5:10 PM

Have you tried Databricks COPY INTO? https://docs.databricks.com/en/ingestion/copy-into/index.html

proud-answer-87162

05/06/2024, 5:20 PM

we have not. our ingestion paradigm and schema needs to test source data to see if it's a duplicate, hence using

merge

. if your source system guarantees write-once then copy to might work

👍 1

2 Views

Open in Slack

Previous Next