Hello, I need an existing process that runs in dat...
# ask-the-community
f
Hello, I need an existing process that runs in databricks notebook that reads feature data from a Databricks Catalog table (e.g. bolt_feature_engg_prod.gold.audience_segments_feature_set) using SQL:
Copy code
input_data_query = f'select ... FROM bolt_feature_engg_prod.gold.audience_segments_feature_set where ...'
input_data = spark.sql(input_data_query)
feature_data=input_data.toPandas()
I need to convert the Databricks process to a Flyte workflow and I would like to re-use the SQL query because it is a very complicated query. Also because my upstream to produce the feature data cannot be changed and must be stored in the same Databricks Catalog table. How can I do that? Thanks!
c
is your question specifically about interacting with databricks within a flyte
Task
? if so, you can use the python databricks SDK to run arbitrary queries that you have already defined:
Copy code
"""
Doc's found here. <https://learn.microsoft.com/en-us/azure/databricks/dev-tools/python-sql-connector>
"""
from databricks import sql
stmt = "SELECT ..."
    with sql.connect(server_hostname=server_hostname,
                     http_path=http_path,
                     access_token=access_token,
                     **kwargs) as connection:
        with connection.cursor() as cursor:
            cursor.execute(stmt)

            return cursor.fetchall()
f
Hi @Chris Grass, Thank you for the quick response. Do you think it will work in non-azure environment? Our Databricks cluster is hosted in AWS.
c
@Frank Shen i assume so - the sdk isn't azure specific (only the doc i linked to). there are several other python SDKs that are offered as well, but some require Unity Catalog to be enabled in your workspace
f
Hi @Chris Grass, I tried it and it works with small dataset. Thanks! However, I have large dataset about at least 7 million records. Reading is OK although slow. But saving the data about that size back to databricks unity catalog table took way too long and won’t finish. How you you solve this issue? And do you know how to connect Spark jobs running in EKS to databricks? I need to run Spark jobs from external EKS. Spark should save large dataset to databricks. I just don’t know the connnectivity park.
c
we solve this problem by staging the data in a mount that databricks has access to; we write parquet files that match the destination to an azure blob store container but this can be done several ways. it's also important to understand how partitions are used when inserting data.
f
Thanks @Chris Grass, I will take a look
Hi @Chris Grass, The stage link above https://www.confessionsofadataguy.com/lessons-learned-from-merge-operations-with-billions-of-records-on-databricks-spark/ you provided is for MERGE INTO … Is this what you really wanted to share for staging data in a mount?
Have a nice weekend!
c
@Frank Shen the merge into is the sql syntax but you'll notice that it references a mount location with
USING some_staging_table as S
the actual staging of the data from your app's perspective is just pushing to the azure blob storage container
then you configure your dbricks workspace to use that container as a mount
f
Hi @Chris Grass, I see. Thank you for the explanation. However, we don’t use Azure. We use AWS. What do you think is the equivalence of Azure Blob storage container in AWS? Do you think AWS S3 will work?
c
yup, s3 should be fine
f
c
we have not. our ingestion paradigm and schema needs to test source data to see if it's a duplicate, hence using
merge
. if your source system guarantees write-once then copy to might work