How does development with Flyte compare to Dagster Flyte #ask-ai

Join Slack

How does development with Flyte compare to Dagster...

# ask-ai

little-lock-48902

07/31/2024, 5:28 AM

How does development with Flyte compare to Dagster?

ancient-wolf-19325

07/31/2024, 5:28 AM

Flyte and Dagster have several differences in their development processes and features. Flyte is designed for scalability and reproducibility, leveraging Kubernetes for distributed processing, and supports multi-language workflows, whereas Dagster is more Python-centric and recently introduced higher-level APIs like software-defined assets for ETL tasks (source). Flyte's development lifecycle includes features like caching, remote API, and agents, which aid in local workflow development (source). Flyte also supports versioning of workflows and tasks, enabling reproducibility and efficient debugging (source). Would you like more details on specific aspects of the development process? Sources: - https://discuss.flyte.org/t/8135475/there-are-several-good-comparison-articles-between-flyte-and/ - https://docs.flyte.org/en/latest/user_guide/development_lifecycle/index.html - https://github.com/flyteorg/flyte/issues/660 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 5:31 AM

@ancient-wolf-19325 is development with Flyte and running it at small scale possible without deploying a kubernetes instance?

ancient-wolf-19325

07/31/2024, 5:31 AM

Yes, it is possible to develop and run Flyte at a small scale without deploying a full Kubernetes instance. Flyte provides a local development environment called the "Flyte Sandbox," which uses a single Docker container to simulate a Flyte cluster (source). This setup allows you to develop and test workflows locally without the overhead of managing a Kubernetes cluster. Additionally, Flyte supports running workflows in a local Python environment using

pyflyte run

, which is useful for quick iterations and testing (source). Would you like more details on setting up the Flyte Sandbox or running workflows locally? Sources: - https://github.com/flyteorg/flyte/issues/2903 - https://github.com/flyteorg/flyte/blob/master/docs/getting_started_with_workflow_development/running_a_workflow_locally.md 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 5:32 AM

@ancient-wolf-19325 is there an equivalent to Dagster's asset concept in Flyte?

ancient-wolf-19325

07/31/2024, 5:32 AM

I'm not able to find any data sources to answer your question. You might want to specify what aspects of Dagster's asset concept you are interested in or how you plan to use it in Flyte. If this doesn't help, please reach out to the Flyte support team for further assistance. 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 5:33 AM

@ancient-wolf-19325 is there something similar like Dagster's assets in Flyte?

ancient-wolf-19325

07/31/2024, 5:33 AM

Flyte does not have a direct equivalent to Dagster's "assets" concept. However, Flyte's architecture allows for similar functionality through its strongly-typed tasks and workflows, which can be used to manage and track data transformations and dependencies. In Flyte, tasks are the basic unit of execution and can be composed into workflows. Each task and workflow can have strongly-typed inputs and outputs, enabling clear data lineage and reproducibility (source). Flyte also supports caching and versioning, which can help manage data artifacts and their transformations (source). Would you like more details on how to implement this in Flyte? Sources: - https://flyte.org/blog/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow/ - https://flyte.org/blog/using-flyte-to-power-mlops-at-striveworks/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 5:36 AM

@ancient-wolf-19325 how or where is output stored with Flyte, and how is input gathered?

ancient-wolf-19325

07/31/2024, 5:36 AM

In Flyte, outputs are stored in object stores like S3 or GCS, and inputs are gathered from these stores or directly from the metadata. Flyte handles data in two parts: Metadata and Raw data. 1. Metadata: This includes data about task inputs and other artifacts. Metadata is stored in Flyte's control plane components and is accessible via FlyteConsole or CLI. Primitive data types and references to large objects are stored as metadata (source). 2. Raw Data: This is the actual data, such as Pandas DataFrames or files. Raw data paths are unique for every execution and are stored in object stores. Flyte only deals with references to these large objects, not the objects themselves (source). For example, when a task produces an output, Flyte offloads the data to an object store and stores a reference to it in the metadata. When another task needs this data as input, Flyte retrieves the reference from the metadata and downloads the data from the object store. Here is a simple example of a task that reads and writes data:

Copy code

from flytekit import task, workflow
import pandas as pd

@task
def generate_data() -> pd.DataFrame:
    data = {'col1': [1, 2], 'col2': [3, 4]}
    return pd.DataFrame(data)

@task
def process_data(df: pd.DataFrame) -> pd.DataFrame:
    df['col3'] = df['col1'] + df['col2']
    return df

@workflow
def data_workflow() -> pd.DataFrame:
    df = generate_data()
    result = process_data(df=df)
    return result

Would you like more details on configuring storage or handling specific data types? Sources: - https://docs.flyte.org/en/latest/concepts/data_management.html#metadata - https://docs.flyte.org/en/latest/concepts/data_management.html 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 5:36 AM

@ancient-wolf-19325 in which programming languages can I develop for Flyte?

ancient-wolf-19325

07/31/2024, 5:36 AM

You can develop for Flyte primarily using Python, Java, and Scala. Flyte provides SDKs for these languages to help you create tasks and workflows. 1. Python: The primary SDK for Flyte is

flytekit

, which is extensively documented and widely used for developing tasks and workflows (source). 2. Java/Scala: Flyte also supports Java and Scala through the

flytekit-java

SDK, which allows you to define and execute tasks and workflows in these languages (source). Would you like more details on setting up the SDKs or examples in a specific language? Sources: - https://flytekit.readthedocs.io - https://github.com/spotify/flytekit-java 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 6:07 AM

@ancient-wolf-19325 how is Flyte integrated with event queue systems / message brokers?

ancient-wolf-19325

07/31/2024, 6:07 AM

Flyte integrates with event queue systems and message brokers like AWS SNS/SQS, GCP Pub/Sub, and Apache Kafka through its cloud events configuration. This allows Flyte to emit events related to workflow and task execution, which can be consumed by external systems for further processing or notifications. 1. AWS SNS/SQS: Flyte can be configured to send events to AWS SNS, which can then push these events to SQS. This setup is useful for triggering downstream processes or notifications (source). 2. GCP Pub/Sub: Similar to AWS, Flyte can send events to GCP Pub/Sub, enabling integration with GCP-based workflows and services (source). 3. Apache Kafka: Flyte can also publish events to Apache Kafka, which is useful for high-throughput, low-latency event processing (source). Here is an example configuration for enabling cloud events in Flyte:

Copy code

cloudEvents:
  enable: true
  aws:
    region: us-east-2
    eventsPublisher:
      eventTypes: 
        - all  # or node, task, workflow
      topicName: arn:aws:sns:us-east-2:123456:123-my-topic
      type: aws

Would you like more details on configuring specific integrations or examples of use cases? Sources: - https://docs.flyte.org/en/latest/deployment/cluster_config/cloud_event.html 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 6:09 AM

@ancient-wolf-19325 what about receiving event messages?

ancient-wolf-19325

07/31/2024, 6:09 AM

Flyte primarily focuses on emitting events to external systems rather than directly receiving event messages. However, you can design workflows to be triggered by external events using various mechanisms: 1. Scheduled Workflows: Flyte supports scheduling workflows using cron-like schedules. This can be configured in the FlyteAdmin configuration (source). 2. External Triggers: You can use external systems to trigger Flyte workflows by invoking FlyteAdmin's API. For example, an AWS Lambda function or a GCP Cloud Function can listen to an event queue and trigger a Flyte workflow via an API call (source). 3. Custom Event Handling: You can create custom tasks within Flyte that poll an event queue or message broker for new messages and then process them. This approach allows you to integrate Flyte with any event-driven architecture. Here is an example of a custom task that polls an SQS queue:

Copy code

import boto3
from flytekit import task, workflow

@task
def poll_sqs(queue_url: str) -> str:
    sqs = boto3.client('sqs')
    response = sqs.receive_message(QueueUrl=queue_url, MaxNumberOfMessages=1)
    messages = response.get('Messages', [])
    if messages:
        message = messages[0]
        receipt_handle = message['ReceiptHandle']
        sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=receipt_handle)
        return message['Body']
    return "No messages"

@workflow
def event_driven_workflow(queue_url: str) -> str:
    return poll_sqs(queue_url=queue_url)

Would you like more details on any specific integration or example? Sources: - https://github.com/flyteorg/flytesnacks/blob/master/examples/productionizing/productionizing/lp_schedules.py - https://docs.flyte.org/en/latest/deployment/cluster_config/eventing.html 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 6:29 AM

@ancient-wolf-19325 what is the history and story of Flyte? How was it invented, etc

ancient-wolf-19325

07/31/2024, 6:29 AM

Flyte was developed at Lyft to address the challenges of orchestrating machine learning and data processing workflows. The project began in late 2016 to support Lyft's Estimated Time of Arrival (ETA) models, which required processing large amounts of historical data and frequent retraining of models (source). Initially, Lyft used AWS Step Functions for scheduling, but as the platform grew, they encountered scalability and usability issues. This led to the development of a container-native scheduling engine, resulting in the creation of Flyte (source). Flyte was open-sourced in early 2020 to foster a community-driven development model. It has since been adopted by various organizations, including Spotify and Freenome, and has become a top-level project under the Linux Foundation AI & Data chapter (source). Would you like more details on specific milestones or contributions to Flyte? Sources: - https://flyte.org/blog/flyte-joins-lf-ai-and-data/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 6:39 AM

@ancient-wolf-19325 which architectural aspects of Flyte, aside from the clustered approach, have been crafted intentionally and make it a good choice for structuring the data processing part of a datascience project?

ancient-wolf-19325

07/31/2024, 6:39 AM

Flyte's architecture includes several intentional design aspects that make it a strong choice for structuring the data processing part of a data science project: 1. Strong Typing and Versioning: Flyte enforces strong typing for all inputs and outputs, ensuring data consistency and enabling rich data lineage. Every task and workflow is versioned, allowing for reproducibility and easy rollback to previous versions (source). 2. Separation of Concerns: Flyte separates business logic from infrastructure management. This allows data scientists to focus on developing workflows without worrying about the underlying infrastructure. The central infrastructure team can manage resources, ensuring efficient utilization (source). 3. Extensibility and Integrations: Flyte is highly extensible, supporting plugins and integrations with various tools and services like AWS Sagemaker, GCP BigQuery, and Apache Spark. This flexibility allows Flyte to fit into diverse data processing ecosystems (source). 4. Caching and Data Lineage: Flyte supports caching of task outputs, which can significantly speed up repeated executions and save computational resources. The data catalog feature provides detailed data lineage, helping track the origin and transformation of data throughout the workflow (source). 5. Dynamic and Static Graphs: Flyte supports both static and dynamic workflows. Static graphs are reproducible and shareable, while dynamic graphs can adapt to changing inputs and conditions, providing flexibility without sacrificing reproducibility (source). Would you like more details on any specific architectural aspect or feature? Sources: - https://flyte.org/blog/introducing-flyte-a-cloud-native-machine-learning-and-data-processing-platform/ - https://flyte.org/blog/from-incubation-to-graduation-and-beyond/ - https://flyte.org/blog/using-flyte-to-power-mlops-at-striveworks/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

little-lock-48902

07/31/2024, 6:45 AM

@ancient-wolf-19325 what are some videos (e.g. youtube), that go into using Flyte with Jupyter notebooks?

ancient-wolf-19325

07/31/2024, 6:45 AM

I'm unable to find any data sources to answer your question. You might want to specify what aspects of using Flyte with Jupyter notebooks you are interested in or how you plan to use it. If this doesn't help, please reach out to the Flyte support team for further assistance. Sources: - https://github.com/flyteorg/flytesnacks/blob/master/examples/exploratory_data_analysis/README.md - https://discuss.flyte.org/t/2714323/hi-i-am-trying-to-integrate-jupyter-notebook-with-flyte-when/ 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

average-finland-92144

07/31/2024, 10:33 AM

@little-lock-48902 all great questions, I hope you find some of it useful. Support for using Flyte from a Jupyter Notebook is in progress right now (ref)

👍 1

bumpy-match-83743

08/01/2024, 2:32 PM

@ancient-wolf-19325 can you compare Flyte to raw kubernetes jobs?

ancient-wolf-19325

08/01/2024, 2:32 PM

Flyte and raw Kubernetes jobs both enable the execution of tasks on a Kubernetes cluster, but they differ significantly in terms of features, ease of use, and capabilities. 1. Abstraction and Ease of Use: - Flyte: Provides a high-level abstraction for defining and managing workflows. It allows users to define tasks and workflows using Python, Java, or Scala SDKs, and handles the orchestration, scheduling, and execution of these tasks (source). - Kubernetes Jobs: Require users to manually define and manage each job using Kubernetes YAML manifests. This involves specifying the container images, commands, and resource requirements for each job, which can be cumbersome for complex workflows (source). 2. Workflow Management: - Flyte: Supports complex workflows with dependencies between tasks, dynamic workflows, and conditional execution. It also provides features like task caching, retries, and failure handling (source). - Kubernetes Jobs: Do not natively support complex workflows or dependencies between jobs. Users need to manually manage the execution order and handle retries and failures. 3. Data Management and Lineage: - Flyte: Offers built-in support for data management, including strong typing, data lineage, and versioning. It tracks the inputs and outputs of each task, ensuring reproducibility and easy debugging (source). - Kubernetes Jobs: Do not provide built-in data management or lineage tracking. Users need to implement their own mechanisms for tracking data and ensuring reproducibility. 4. Scalability and Resource Management: - Flyte: Leverages Kubernetes for scalability and resource management but adds additional features like resource quotas, multi-cluster support, and efficient resource utilization through task isolation (source). - Kubernetes Jobs: Provide basic scalability and resource management features but lack the advanced capabilities and optimizations offered by Flyte. 5. Monitoring and Visualization: - Flyte: Includes a web-based console (FlyteConsole) for monitoring and visualizing workflows, task executions, and data lineage. It also provides APIs for programmatic access to execution data (source). - Kubernetes Jobs: Require users to rely on Kubernetes-native tools like

kubectl

and third-party monitoring solutions for tracking job status and logs. Would you like more details on any specific aspect of Flyte or Kubernetes jobs? Sources:

ancient-wolf-19325

08/01/2024, 2:33 PM

- https://docs.flyte.org/en/latest/introduction.html - https://kubernetes.io/docs/concepts/workloads/controllers/job/ - https://docs.flyte.org/en/latest/concepts/workflows.html - https://flyte.org/blog/upleveling-flyte-data-lineage-using-dolt/ - https://flyte.org/blog/orchestrating-data-pipelines-at-lyft-comparing-flyte-and-airflow/ - https://docs.flyte.org/en/latest/concepts/flyte_console.html 👍 0 button 👎 0 button 💡 Hint: Mention @runllm in the thread for followups.

402 Views

Open in Slack

Previous Next