flyte-org #ask-the-community

Hi, I have a bunch of `ContainerTask`s and want to specify the image tag to use in my workflow as an input (so I can override the default when launching), is there a way to do this?

Giacomo Dabisias IT

01/02/2023, 4:53 PM

Hi all! I am currently evaluating if Flyte could be used as a workflows system to handle multiple on prem compute clusters. Few questions: • We have some clusters that are running k8s, but some are running SLURM. I know that flyte handles k8s, but can it schedule jobs on top of SLURM? • Can Flyte load balance workflows between multiple compute clusters with different physical locations? • Can Flyte use a custom data plane which is API compatible with aws S3?

Ophir Yoktan

01/02/2023, 7:02 PM

My use case is as follows: the initial task of the flyte workflow extracts a dataset * in production, I want to always use the latest data (or use cache with short expiration * during development, runtime is more important then data freshness - so I prefer to use cached dataset is there an option to control somehow the caching when launching a workflow (and not just when defining the workflow) Cross posting: https://stackoverflow.com/q/74984146/590335

Panos Strouth

01/03/2023, 10:45 AM

Hi everyone! Happy new year! Has anyone every used AWS Cognito for authentication with Flyte? (We deployed Flyte on EKS) Cognito is a requirement for us in order to give access only to authenticated users. Currently we are facing several issues following this guide: https://docs.flyte.org/en/latest/deployment/cluster_config/auth_setup.html (Okta worked seamlessly but we decided to not use it. Cognito is a must for our setup) When Flyte tries to access our Cognito domain we get the following error in one of the admin pods:

{"json":{},"level":"error","msg":"Error creating auth context [AUTH_CONTEXT_SETUP_FAILED] Error creating oidc provider ....

flytectl or browser does not redirect to Cognito UI for authentication.

Chandramoulee K V

01/03/2023, 1:00 PM

Hi All ! Happy New Year ! I have a

spark

related doubt here. Scenario: Currently while executing the spark workflow The driver and the executors are being scheduled in different pods. Eg: we have

1 driver (4 cores CPU 8GB Memory

) and

4 executors(4 cores CPU 8GB Memory each)

-> 1 node for 1 driver pod and 1 node to accommodate all 4 executor pods. Here the node to accommodate the executors is very large as the request sent for the node is the summation of CPU's and memory of all 4 executors combined so the request is greater than 16 cores and 32 GB memory which will be inefficient going forward with more number of executors. So is there a workaround/fix to make this scale horizontally i.e... spawn up 4 executor pods in separate nodes (or a combination of n nodes to hold n executor pods each node) so we will have nodes running in parallel instead of pods running in parallel inside a

single very large node

Seth Baer

01/03/2023, 4:07 PM

Hey there! Happy New Year everyon!

Dan Rammer (hamersaw)

01/03/2023, 5:37 PM

Hey @Seth Baer, this feature certainly has not be deprecated. In fact, as I understand, there are a few things we have discussed adding recently. Do you know what version of FlyteConsole you're running?

David Cupp

01/03/2023, 6:33 PM

Dynamic Job Registration So the jobs I am planning to run on flight are "dynamic" in that the set of jobs that exist (and their schedules) can change minute to minute. For example, if one of our customers goes into our UI and adds an "export" then we suddenly have a new ExportJob that needs to be scheduled and ready to go in less than 15 minutes. All instances of "ExportJob" share the same code but each one has a unique identity and set of inputs. We typically find all of the jobs that exist (somewhere under 10K total) by periodically making a bunch of services calls and then updating our scheduling system. I'm trying to figure out the correct way to build the same thing using Flyte, and just want to double check my understanding of the solution. I looks like there are two main options: 1. dynamically generate the workflow code and then execute the Flyte CLI's "register" command: https://docs.flyte.org/en/latest/concepts/registration.html 2. manually call

LaunchPlanCreateRequest

? Are both of these supported workflows, or is # 2 considered a bit of a hack?

Rahul Mehta

01/04/2023, 7:17 AM

https://github.com/flyteorg/flyteconsole/issues/638 bumping this issue -- has this been addressed in any recent flyteconsole versions? We're encountering this issue with some of our high fan-out workflows and it's starting to add a bit of user friction

Mücahit

01/04/2023, 9:20 AM

Hi! • It seems like the FlyteAdmin/Control Plane dashboard is outdated since I don’t see any data for these and metric names seem to be changed from

flyte:admin:database:postgres;*

flyte:admin:admin:database:*

etc and it doesn’t seem to be that a quick search/replace can fix it. Do you have any up-to-date dashboard for Flyteadmin/control plane? • I’ve looked at the monitoring docs, and it looks pretty neat, but I was wondering if you have any alert rules for checking system health?

Robin Eklund

01/05/2023, 9:52 AM

Hi! I am trying to build a workflow which will be triggered on a daily basis and also will have the possibility to be triggered for a specific day (more like how airflow is working). So i created this test:

Copy code

@task
def print_the_date(execution_date: str) -> str:
    print(f"in print_the_date: execution_date={execution_date}")
    return execution_date


@workflow
def my_wf(execution_date: str) -> dict[str, str]:
    print(f"inside my_wf: execution_date={execution_date}")
    print_the_date(execution_date=execution_date)
    return {"execution_date": execution_date}

This works perfectly fine running locally like this:

Copy code

pyflyte run workflow.py my_wf --execution_date 2023-01-01

But not sure how to do this on a scheduled basis, i have tried with this:

Copy code

LaunchPlan.get_or_create(
    name="my_lp",
    workflow=my_wf,
    schedule=CronSchedule(schedule="0 7 * * *"),
    security_context=security_context,
    default_inputs={
        "execution_date": datetime.today().strftime('%Y-%m-%d %H:%M:%S')
    }
)

But i understand this doesn't work because it created the default inputs in "compile-time" of the workflow(?). Also i would like to be able to do like this when running locally:

Copy code

pyflyte run workflow.py my_wf

But it seems Flyte are not supporting default input like this:

Copy code

@workflow
def my_wf(execution_date: Union[str, None] = None) -> dict[str, str]:
   ...

Anyone who have done this before that can point me in the right direction, would appreciate the help!

Felix Ruess

01/05/2023, 4:54 PM

When subworkflows/tasks are not run because the exceed the resource quota, shouldn't they automatically start running once enough resources are available again or the limits were increased? Right now the tasks stay in running, but nothing happens...

Felix Ruess

01/05/2023, 5:13 PM

Ah, seems it's https://github.com/flyteorg/flyte/issues/3065 again...

Frank Shen

01/05/2023, 7:50 PM

Happy New Year! I have large machine learning feature dataset stored as many multiple parquet files in an AWS S3 folder (key). I have a flyte task to read the data in and return it as a pandas DF. Due to the large data size, I prefer to use flyte spark task to read the data. sample code:

Copy code

@task(
    container_image="<http://xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest|xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest>",
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def read_spark_df() -> pandas.DataFrame:
    sess = flytekit.current_context().spark_session
    spark_df = sess.read.parquet("<s3a://bucket/key.parquet>").toPandas()
    df = pandas.DataFrame(spark_df)
    return df

Frank Shen

01/05/2023, 7:52 PM

Is this supported in flyte? Could anyone share their successful experience, and what additional steps or configs are needed? Thanks.

Felix Ruess

01/05/2023, 8:39 PM

I'm trying to debug a task failure: it stopped after 1h15min and I can't seem to find why. Is there a timeout set in flyte by default somewhere?

Frank Shen

01/06/2023, 6:46 PM

I have the following spark task (read parquet data from S3) working in flyte remote (the flyte remote env have been setup to access S3 apparently).

Copy code

@task(
    container_image="<http://xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest|xyz.dkr.ecr.us-east-1.amazonaws.com/flyte-pyspark:latest>",
    task_config=Spark(
        spark_conf={...
        }
    ),
)
def read_spark_df() -> pandas.DataFrame:
    sess = flytekit.current_context().spark_session
    spark_df = sess.read.parquet("<s3a://bucket/key.parquet>").toPandas()
    ....

Frank Shen

01/06/2023, 6:49 PM

However, if I ran it locally, it failed for

Copy code

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:53)

Frank Shen

01/06/2023, 6:51 PM

Could anyone explain how to setup local env to read from S3 using spark?

Frank Shen

01/06/2023, 10:38 PM

This is a common use case, why I could not find any answer to it? I am using a Mac.

Frank Shen

01/06/2023, 10:39 PM

Does anyone in this community read data from S3 using spark task in flyte on local?

Frank Shen

01/06/2023, 10:40 PM

Run it local is necessary before you can deploy it to remote, isn’t it?

Rahul Mehta

01/06/2023, 11:37 PM

Hey @Ketan (kumare3), I recall you mentioned a while back in one of our syncs that the metadata fields to expose users who submitted workflows may be available by the API but isn't exposed in the frontend today/it'd be possible to specify a "user" even without auth (basically so we can identify runs in the UI based on the user who launched it). Do you have any other color here/could someone give me some pointers?

Frank Shen

01/07/2023, 12:43 AM

@Yee, @Kevin Su, @Ketan (kumare3), I made it working in pyspark. However, it failed in flyte with error:

Copy code

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to <http://nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808|nyxmmedina01741.wmad.warnermedia.com/10.217.173.85:55808>

Stephen

01/09/2023, 10:45 AM

Hey, is anyone sending logs from Flyte to Datadog using Datadog agent? We want to do it using annotations but DataDog log scraping discovery relies on this annotation:

<http://ad.datadoghq.com/|ad.datadoghq.com/><CONTAINER_IDENTIFIER>.logs: '[<LOG_CONFIG>]'

<CONTAINER_IDENTIFIER>

here needs to be identical to the container name from which to scrape the logs. When running Flyte tasks the container name is generated from the execution ID which cannot be known in advance. It is possible to configure default annotations for all Flyte workflow pods centrally. It is even possible to inject things like project name and domain dynamically into these default annotations but it is not possible to dynamically inject the container name(s).

Niels Bantilan

01/09/2023, 2:37 PM

📣 hey all, just a quick call to action for you data engineers out there 📣 We’d love it if you can share your experiences using Flyte for DE use cases in this reddit thread: https://www.reddit.com/r/dataengineering/comments/106f68v/are_you_using_an_orchestra[…]xiciz%2F99NochIv57CkhcYg8%2F6H0q%2FazJnR1%2F34IXljwEAAA%3D%3D

Fabio Grätz

01/09/2023, 4:00 PM

For distributed pytorch (or tf, …) tasks, the return value of which worker is passed along to subsequent tasks? Is this random/a race condition? When creating a Pytorch task, the

args

of both pods specify the same values for:

Copy code

- --output-prefix
    - gs://.../metadata/propeller/sandbox-development-f6695ca08aa47490c859/n0/data/0
    - --raw-output-data-prefix
    - gs://.../xq/f6695ca08aa47490c859-n0-0



    - --output-prefix
    - gs://...metadata/propeller/sandbox-development-f6695ca08aa47490c859/n0/data/0
    - --raw-output-data-prefix
    - gs://.../xq/f6695ca08aa47490c859-n0-0

In case both return the same value (as ist assumably often the case), it shouldn’t matter if both write. But in case I want to return a metric which might be slightly different for each worker, is it random which one I get?

Samhita Alla

01/09/2023, 5:01 PM

Hey everyone! If you're a Flyte user, we would be grateful if you could share your thoughts on the https://www.reddit.com/r/dataengineering/comments/106f68v/are_you_using_an_orchestra[…]xiciz%2F99NochIv57CkhcYg8%2F6H0q%2FazJnR1%2F34IXljwEAAA%3D%3D thread about why you chose it. Thank you!

Felix Ruess

01/09/2023, 5:52 PM

Hi, does anyone know why tasks that were executed as part of a workflow don't show up in the console task execution list? Same goes for workflows that were executed as subworkflows... Is this a bug or a missing feature or am I missing something else?