https://flyte.org logo
Join the conversationJoin Slack
Channels
announcements
ask-the-community
auth
conference-talks
contribute
databricks-integration
datahub-flyte
deployment
ecosystem-unionml
engineeringlabs
events
feature-discussions
flyte-bazel
flyte-build
flyte-console
flyte-deployment
flyte-documentation
flyte-github
flyte-ui-ux
flytekit
flytekit-java
flytelab
great-content
hacktoberfest-2022
helsing-flyte
in-flyte-conversations
introductions
jobs
konan-integration
linkedin-flyte
random
ray-integration
ray-on-flyte
release
scipy-2022-sprint
sig-large-models
workflow-building-ui-proj
writing-w-sfloris
Powered by Linen
ray-integration
  • k

    Kamakshi Muthukrishnan

    11/08/2022, 5:17 AM
    @Samhita Alla Looking for peers who have done Ray integration.. Closed group to discuss
  • s

    Samhita Alla

    11/08/2022, 5:18 AM
    Needn’t be a closed group, but yea, let me add my team to this channel.
  • s

    Samhita Alla

    11/08/2022, 5:20 AM
    Hello! @Kamakshi Muthukrishnan has some queries regarding the ray integration and requested for a dedicated channel with people who have tried out the integration.
  • k

    Ketan (kumare3)

    11/08/2022, 5:27 AM
    @Kamakshi Muthukrishnan please ask your questions
  • k

    Ketan (kumare3)

    11/08/2022, 5:27 AM
    its very hard to add people as you know its a community
  • k

    Ketan (kumare3)

    11/08/2022, 5:27 AM
    but fire away
  • k

    Ketan (kumare3)

    11/08/2022, 5:27 AM
    cc @Kevin Su
  • k

    Kamakshi Muthukrishnan

    11/08/2022, 5:36 AM
    @Kevin Su @Ketan (kumare3) I was trying to update the worker's resource for Ray instance... But I got the below error details = "Requested CPU default [10] is greater than current limit set in the platform configuration [2]. Please contact Flyte Admins to change these limits or consult the configuration" debug_error_string = "UNKNOWN:Error received from peer ipv4:3.229.51.32:443 {created_time:"2022-11-08T05:35:09.299255707+00:00", grpc_status:3, grpc_message:"Requested CPU default [10] is greater than current limit set in the platform configuration [2]. Please contact Flyte Admins to change these limits or consult the configuration"}"
  • k

    Kamakshi Muthukrishnan

    11/08/2022, 5:59 AM
    and when I just changed it to limit the CPU by 2 , the task is in Queued status for long time.. its almost 20 min now.. Should I change ?
  • k

    Kamakshi Muthukrishnan

    11/08/2022, 6:00 AM
    @task(task_config=RayJobConfig(...), requests=Resources(cpu="2"), limits=Resources(gpu="2"))
  • k

    Kevin Su

    11/08/2022, 5:59 PM
    The default maximum cpu is 2, you could increase in flyteadmin configmap
    task_resource_defaults.yaml: |
        task_resources:
          defaults:
            cpu: 400m
            memory: 500Mi
            storage: 500Mi
          limits:
            cpu: 2
            gpu: 1
            memory: 4Gi
            storage: 20Mi
  • k

    Kevin Su

    11/08/2022, 6:03 PM
    And here is the default quota per project-domain. if you create too many ray workers, and the total cpu (number of workers * cpu per worker) exceeds the default quota, the task will always be in the queue status.
    cluster_resources:
          customData:
          - production:
            - projectQuotaCpu:
                value: "5"
            - projectQuotaMemory:
                value: 4000Mi
          - staging:
            - projectQuotaCpu:
                value: "2"
            - projectQuotaMemory:
                value: 3000Mi
          - development:
            - projectQuotaCpu:
                value: "12"
            - projectQuotaMemory:
                value: 8000Mi
  • k

    karthikraj

    11/10/2022, 2:32 AM
    @Kevin Su Do we need to deploy the KuberayOperator and RayCluster in separate Namespace or it needs to be deployed in the same namespace where the Flyte components are running? https://blog.flyte.org/ray-and-flyte Here they have given a link to Github page where they have mentioned name space as ray-system in README file. Does it matters?
    k
    • 2
    • 1
  • p

    Padma Priya M

    11/10/2022, 4:21 AM
    hi, can we run ray tune experiments in flyte. bcause i am getting error while executing.
    import typing
    
    import ray
    from ray import tune
    from flytekit import Resources, task, workflow
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
    
    
    @ray.remote
    def objective(config):
        return (config["x"] * config["x"])
    
    
    ray_config = RayJobConfig(
        head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
        worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
        runtime_env={"pip": ["numpy", "pandas"]},
    )
    
    
    @task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1"))
    def ray_task(n: int) -> int:
        model_params = {
            "x": tune.randint(-10, 10)
        }
    
        tuner = tune.Tuner(
            objective,
            tune_config=tune.TuneConfig(
                num_samples=10,
                max_concurrent_trials=n,
            ),
    
            param_space=model_params,
        )
        results = tuner.fit()
        return results
    
    
    @workflow
    def ray_workflow(n: int) -> int:
        return ray_task(n=n)
    is there any other ways to run hyperparameter tuning in a distributed manner like ray tune?
    k
    k
    • 3
    • 15
  • k

    Kamakshi Muthukrishnan

    11/15/2022, 1:28 PM
    Hi .. I earlier tried using ray.init() for my previous flyte task. Now how should i override the RAY engine to use default. Even after shutting down the ray instance, . I can see Ray gets initialized automatically??
    k
    • 2
    • 9
  • k

    karthikraj

    11/17/2022, 5:30 AM
    Hi Team, We are trying to run the flyte workflow for ML training using xgboost classifier but we are getting below error. Could you please help.
    Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'CPU': 1.0, 'object_store_memory': 217143276.0, 'memory': 434412750.0, 'node:10.69.53.118': 0.98}, resources requested by the placement group: [{'CPU': 1.0}, {'CPU': 1.0}]
    s
    k
    • 3
    • 2
  • d

    Dylan Wilder

    11/18/2022, 9:36 PM
    hey a while back there was an RFC on ray integration that included something about support for persisting cluster resources across tasks, is that something still in progress? can someone point me to the docs?
    y
    k
    +2
    • 5
    • 23
  • p

    Padma Priya M

    11/21/2022, 5:18 AM
    Hi, how to view the initiated ray cluster in ray dashboard. bcause while running i can see that the ray cluster is initiated locally on 8265 port but when the port is opened it shows site can't be reached even when the workflow is running at that moment.
    k
    • 2
    • 1
  • p

    Padma Priya M

    11/29/2022, 2:57 PM
    Hi, was trying distributed training using ray in flyte. I am getting this error while running.
    from flytekitplugins.ray import HeadNodeConfig, RayJobConfig, WorkerNodeConfig
    import ray
    from ray import tune
    
    #ray.init()
    #ray.init("auto", ignore_reinit_error=True)
    
    ray_config = RayJobConfig(
        head_node_config=HeadNodeConfig(ray_start_params={"log-color": "True"}),
        worker_node_config=[WorkerNodeConfig(group_name="ray-group", replicas=2)],
    )
    
    num_actors = 4
    num_cpus_per_actor = 1
    
    ray_params = RayParams(
        num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)
    
    
    def train_model(config):
        train_x, train_y = load_breast_cancer(return_X_y=True)
        train_set = RayDMatrix(train_x, train_y)
    
        evals_result = {}
        bst = train(
            params=config,
            dtrain=train_set,
            evals_result=evals_result,
            evals=[(train_set, "train")],
            verbose_eval=False,
            ray_params=ray_params)
        bst.save_model("model.xgb")
    
    
    
    @task(task_config=ray_config, limits=Resources(mem="2000Mi", cpu="1"))
    def train_model_task() -> dict:
        config = {
            "tree_method": "approx",
            "objective": "binary:logistic",
            "eval_metric": ["logloss", "error"],
            "eta": tune.loguniform(1e-4, 1e-1),
            "subsample": tune.uniform(0.5, 1.0),
            "max_depth": tune.randint(1, 9)
        }
    
    
        analysis = tune.run(
            train_model,
            config=config,
            metric="train-error",
            mode="min",
            num_samples=4,
            resources_per_trial=ray_params.get_tune_resources())
        return analysis.best_config
    
    @workflow
    def train_model_wf() -> dict:
        return train_model_task()
    k
    s
    k
    • 4
    • 37
  • p

    Padma Priya M

    12/08/2022, 3:11 PM
    Hi, while initiating ray cluster, the task is running in only one instance and pod. Generally if a ray cluster is initiated it is expected to run in different instance in distributed manner right? can we do horizontal scaling here to increase the pool of resources here?
    s
    k
    • 3
    • 17
  • h

    Hiromu Hota

    12/20/2022, 9:21 PM
    Hi! Is there a way to shorten
    ttlSecondsAfterFinished
    ? By default, it is 3600s (1 hour) and we’d like to tear down a cluster right after a job is complete. Thanks for your help!
    $ k describe rayjobs feb5da8c2a2394fb4ac8-n0-0 -n flytesnacks-development
    ...
      Ttl Seconds After Finished:   3600
    k
    k
    • 3
    • 6
  • k

    Kevin Su

    12/21/2022, 8:50 AM
    FYI: @Padma Priya M and I found some issues when running the task on Kuberay 0.4.0. if you get any error as well, please downgrade to the 0.3.0 first. I’ll take a look into it at the end of this month.
    k
    p
    • 3
    • 5
  • p

    Padma Priya M

    12/22/2022, 1:50 PM
    Other issue I faced was when I tried to run a workflow by registering using fast registration I got this error.
    pyflyte register --image <image name> <file name> --version <version number>
    . I used the image built with required dependencies and the Kuberay version was 0.3.0.
  • p

    Padma Priya M

    12/22/2022, 1:52 PM
    In version 0.3.0, while running the basic ray task mentioned in the documentation https://docs.flyte.org/projects/cookbook/en/latest/auto/integrations/kubernetes/ray_example/ray_example.html, the pods were getting up and running but the execution got queued in the console until we terminate manually.
    k
    • 2
    • 1
  • p

    Padma Priya M

    01/13/2023, 4:33 AM
    Hi, in KubeRay version 0.3.0 while trying to perform ray training in remote using
    pyflyte --config ~/.flyte/config-remote.yaml run --remote --image <image_name> ray_demo.py wf
    , I am getting this issue in logs and the task is getting queued in the console. When the same is executed in local using
    pyflyte --config ~/.flyte/config-remote.yaml run --image <image_name> ray_demo.py wf
    , it works fine.
    s
    k
    • 3
    • 2
  • p

    Padma Priya M

    01/16/2023, 5:48 AM
    Hi, I have a doubt regarding scaling of nodes. Do we have options to make each worker pod run in different node so that the node will spawn 'n' number of nodes with a less memory instance? for eg, Now if I request for 8G memory and 4 CPU and request for 4 replicas, the node is spawning an instance with higher GB instance and trying to accommodate all worker nodes in single node. Instead I need an approach where each worker pod should schedule in 4 different node with less GB instance. Do we have any way to achieve this scaling?
    s
    • 2
    • 2
  • r

    Ruksana Kabealo

    01/30/2023, 8:42 PM
    Hello! I am running Ray on Flyte. I am getting a warning about Ray using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. To fix this, I would just specify --shm-size=3.55gb when running the container. But Flyte is running the containers for us, so I cannot figure out how to specify any run options. Is there a way for specifying run options for the containers that Flyte runs? Full text of warning attached.
    ray_on_flyte_error.txt
    k
    d
    • 3
    • 7
  • m

    Marcin Zieminski

    02/23/2023, 8:55 PM
    Hi, I'm trying to get some of my data processing jobs working on Ray with Flyte. I have KubeRay 0.4.0 installed. When my flyte task is started, ray cluster gets created. It is accessible, I can access the dashboard. I can also send jobs to it from my local machine with port forwarding. Unfortunately the original job I have is not run, so the cluster is waiting with no tasks. I looked at the logs of kuberay-operator and all I see is the lines being repeated forever:
    2023-02-23T18:08:48.386Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:48.387Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:51.387Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    2023-02-23T18:08:51.388Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:51.388Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:54.388Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    2023-02-23T18:08:54.388Z    INFO    controllers.RayJob    RayJob associated rayCluster found    {"rayjob": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0", "raycluster": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:54.389Z    INFO    controllers.RayJob    waiting for the cluster to be ready    {"rayCluster": "a8xqtvnds2sp7bgkn96k-fzvfpg5y-0-raycluster-xlg4f"}
    2023-02-23T18:08:57.389Z    INFO    controllers.RayJob    reconciling RayJob    {"NamespacedName": "ntropy-development/a8xqtvnds2sp7bgkn96k-fzvfpg5y-0"}
    These logs seem to be generated by the this piece of code: https://github.com/ray-project/kuberay/blob/89f5fba8d6f868f9fedde1fbe22a6eccad88ecc1/ray-operator/controllers/ray/rayjob_controller.go#L174 and are unexpected as the cluster is healthy and I can use it on the side. I would appreciate any help and advice. Do you think the operator version? My flyte deployment is in version: 1.2.1 Ray in cluster is 2.2.0 flytekitplugins-ray: 1.2.7
  • k

    Kevin Su

    02/23/2023, 9:46 PM
    kuberay 0.4.0 has some problems. install 0.3.0 or from master branch. kuberay community is going to release 0.4.1 soon.
  • a

    Abdullah Mobeen

    03/17/2023, 8:04 PM
    Hi, we recently opened a pull request to address the following issue (inter-cluster communication between Flyte and custom Ray cluster). Can someone please review it? It adds to a product Spotify is building that is integral to our machine learning platform.
    k
    k
    +3
    • 6
    • 14
Powered by Linen
Title
a

Abdullah Mobeen

03/17/2023, 8:04 PM
Hi, we recently opened a pull request to address the following issue (inter-cluster communication between Flyte and custom Ray cluster). Can someone please review it? It adds to a product Spotify is building that is integral to our machine learning platform.
cc @Keshi Dai
k

Keshi Dai

03/17/2023, 8:08 PM
Hey @Kevin Su @Ketan (kumare3) we are trying to make Flyte work for our internal Flyte cluster setup. @Abdullah Mobeen opened this PR to enable the inter-cluster communication feature for Ray plugin. Could you guys help take a look? Thank you so much!
k

Kevin Su

03/17/2023, 8:08 PM
Thanks, reviewing
k

Ketan (kumare3)

03/17/2023, 8:11 PM
👍
cc @Dylan Wilder
a

Abdullah Mobeen

03/17/2023, 8:11 PM
Technically, the changes are similar to what Spotify did for the Flyte-Flink plugin. Our data infra team also added context to the issue I linked. Thanks!
k

Ketan (kumare3)

03/17/2023, 8:11 PM
cc @Dan Rammer (hamersaw) to review as well
d

Dylan Wilder

03/17/2023, 8:12 PM
timely 😄
d

Dan Rammer (hamersaw)

03/20/2023, 4:22 PM
Looks great, merged, thanks @Abdullah Mobeen! Now we would just need to update the flyteplugin dependency version in flytepropeller. Is this something you're looking for an immediate propeller release on or are you building your own image anyways?
a

Abdullah Mobeen

03/21/2023, 1:33 PM
Thanks a lot @Dan Rammer (hamersaw)! Yess -- Since we prefer to always stay on a stable Flyte release, it is better if we make a new release to cover this plugin
k

Ketan (kumare3)

03/21/2023, 2:37 PM
Stable Flyte release will be 1.4
k

Keshi Dai

03/21/2023, 7:48 PM
@Ketan (kumare3), what’s the rough timeline for Flyte 1.4 release?
d

Dan Rammer (hamersaw)

03/21/2023, 7:50 PM
So 1.4 is the current stable, 1.5 will be end of month (maybe first week of April) since we switched to a monthly release cycle. I opened this PR to get the plugin updates merged into propeller and will make sure this is merged for the 1.5 release.
k

Keshi Dai

03/21/2023, 8:14 PM
Thanks @Dan Rammer (hamersaw)!
View count: 2