Hi Could you please advise about the workings of the `webapi Flyte #flyte-support

Hi, Could you please advise about the workings of ...

full-toddler-5766

02/24/2024, 11:41 PM

Hi, Could you please advise about the workings of the

webapi.ResourceQuotas

https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#resourcequotas-webapi-resourcequotas? What happens when this limit is hit? Is this an enforcing limit if configured in FlytePropeller for a FlytePlugin?

average-finland-92144

02/27/2024, 4:59 PM

It seems to be part of the plugin config yes (see). As to how exactly Propeller handles a situation when the limits are hit, I'm not sure. Maybe @hallowed-mouse-14616 knows best?

hallowed-mouse-14616

02/27/2024, 6:20 PM

cc @glamorous-carpet-83516 you're intimately familiar with the web API resource quota from agents work right? if not I can dive through the code and figure this out.

glamorous-carpet-83516

02/27/2024, 6:55 PM

@full-toddler-5766 yup, Propeller won’t submit a new job if the limits are reached.

full-toddler-5766

02/29/2024, 10:51 PM

Thank you for the feedback. Interestingly I set the value to 1 and with 4 independent tasks I see them dispatched towards the downstream system at the same time.

full-toddler-5766

02/29/2024, 10:53 PM

Could this be overridden somehow?

glamorous-carpet-83516

03/25/2024, 5:26 PM

did you restart the propeller after you change the config?

glamorous-carpet-83516

03/25/2024, 5:26 PM

also, could you share your propeller config, especially webapi config

high-park-82026

03/25/2024, 9:16 PM

@full-toddler-5766 are you running propeller manager (multiple propellers), without connecting this to Redis, each propeller will have its own view of consumed resources in memory and you will end up with a situation where multiple tasks will run.. Can you do this for me to check that propeller got the right config applied:

Copy code

kubectl port-forward -n flyte deploy/flytepropeller 10254

Then go to the browser and visit: http://localhost:10254/config You should find "resourceQuotas" with the values you specified. Mind sending a screenshot of that?

aloof-painting-18735

03/26/2024, 6:00 AM

Hi @glamorous-carpet-83516 @high-park-82026 @full-toddler-5766 is off this week, let me jump in and follow up this thread. Today, I'll reproduce the issue and share all the details.

👀 1

👍 1

aloof-painting-18735

03/26/2024, 12:44 PM

This is the full config for Databricks Flyte plugin:

Copy code

databricks:
  enabled: true
  upload_entrypoint: true
  plugin_config:
    plugins:
      databricks:
        databricksInstance: <!--- set to our DBX instance --->
        # this is the entrypoint.py for flyte on databricks
        entrypointFile: <!--- set to our DBX entrypoint file location --->
        # this file is mounted by vault agent injector at /vault/secrets
        databricksTokenKey: <!--- set to our DBX token key --->
        webApi:
          caching:
            maxSystemFailures: 5
            resyncInterval: 60s #default value is 30s!
            size: 500000
            workers: 10
          readRateLimiter:
            burst: 20 #default value is 100!
            qps: 10
          resourceMeta: null
          resourceQuotas:
            default: 10 #default value is 1000!
          writeRateLimiter:
            burst: 20 #default value is 100!
            qps: 10

aloof-painting-18735

03/26/2024, 12:50 PM

our observations with the above settings: • Flyte is hitting Databricks API rate limit -

readRateLimiter

settings are not applied • Flyte is ignoring

resourceQuotas / default

value - we set it to 10 and 25 Spark tasks are running simultaneously (that's the limit set by flyteadmin / maxParallelism)

aloof-painting-18735

03/26/2024, 12:56 PM

aloof-painting-18735

03/26/2024, 1:04 PM

It feels like we miss some component that applies the

webApi

config for Flyte Databricks plugin. We have come across the Flyte ResourceManager page, actually we have this Propeller config:

Copy code

propeller:
      resourcemanager:
        type: noop

@glamorous-carpet-83516 Can you please clarify whether we need to setup a ResourceManager to apply settings in

webApi

config?

aloof-painting-18735

03/26/2024, 1:24 PM

@high-park-82026 we have single propeller, the requested screenshot:

aloof-painting-18735

03/26/2024, 1:24 PM

gentle-state-35322

03/26/2024, 8:13 PM

@glamorous-carpet-83516 @high-park-82026 Kindly let us know if you need more details. Our engineering team is stuck and would appreciate any suggestions. cc: @freezing-airport-6809

👀 1

gentle-state-35322

03/26/2024, 8:15 PM

We are running our workflows in Production and it is critical for us to resolve resource constraints

freezing-airport-6809

03/26/2024, 10:27 PM

@gentle-state-35322 as you know we are an open source community- we are happy to help but we cannot provide sla’s. Shall we work on a support plan to provide slas Also I am in meetings all day and haytham is the cto at union. Please help us prioritize

gentle-state-35322

03/26/2024, 11:42 PM

sure @freezing-airport-6809 totally understand. Sorry that we have to bother your team on this issue. We tried few things before reaching out to the community but couldn't get to a resolution.

gentle-state-35322

03/26/2024, 11:43 PM

Thanks again for helping us out!

freezing-airport-6809

03/26/2024, 11:44 PM

We will absolutely help. Thank you for the patience

gentle-state-35322

03/26/2024, 11:52 PM

sharing the go script for the databricks plugin config just for context: https://github.com/flyteorg/flyte/blob/aedde593828d95242df2df93249f897b3ac05d2c/flyteplugins/go/tasks/plugins/webapi/databricks/config.go#L19[…]9C13 ..Robert provided the config that was used in our deployment above

high-park-82026

03/27/2024, 4:37 PM

@gentle-state-35322 @aloof-painting-18735 ah, thank you for sharing propeller config. At the moment the only two implementations for resource manager are Noop (always allow) and Redis (require a redis connection). I mistakenly thought there was an in memory implementation but turns out there aren't. I think the best course of action for your team to do here is to standup a Redis Instance (or AWS MemCache if on AWS) and configure flytepropeller to use that as the backing storage for the resource manager. Something like this:

Copy code

propeller:
  resourcemanager:
    type: redis
    redis:
      hostPaths:
        - <redis replica 1>...
      hostKey: <password>
      maxRetries: 3

aloof-painting-18735

03/27/2024, 4:44 PM

@high-park-82026 I see, thanks for your confirmation! Let us set up a resource manager and see if it solves the problem. Do we need to make any other changes to the apply

webApi

conf?

aloof-painting-18735

03/27/2024, 4:56 PM

@high-park-82026 Are there any special requirements for Flyte plugins to work with ResourceManager? Our current setup relies on the legacy Databricks plugin.

high-park-82026

03/27/2024, 5:01 PM

I'm looking through your config, I believe you have all the pieces for this to work... once a redis resource manager is initialized, the plugin will automatically start using it.

aloof-painting-18735

03/27/2024, 5:03 PM

All right, thank you for your help! We'll setup redis resource manager and get back to you with the results.

gentle-state-35322

03/27/2024, 5:05 PM

Thanks @high-park-82026 !

gentle-state-35322

03/27/2024, 5:05 PM

our team will try it out and will update

high-park-82026

03/27/2024, 5:22 PM

Awesome! wish you all the best... we are here to support you... and I appreciate your understanding of how busy everyone is...

freezing-airport-6809

03/29/2024, 4:29 AM

@gentle-state-35322 did that work?

gentle-state-35322

03/29/2024, 4:48 AM

Robert is working on it @freezing-airport-6809. I will check with Robert and confirm

aloof-painting-18735

04/02/2024, 3:54 PM

Hi @freezing-airport-6809 @high-park-82026 @gentle-state-35322

aloof-painting-18735

04/02/2024, 3:55 PM

We have successfully set up a Redis instance and connected to Flyte. Tried to run a dynamic workflow with 99 tasks, did not override the default max_parallelism value (25).

OBSERVATIONS

1. databricks / resourceQuotas is applied successfully ◦ 10 tasks in RUNNING state - launched (resourceQuotas) ◦ 15 tasks in RUNNING state - queued (max_parallelism - resourceQuotas) ◦ all the remaining tasks in UNKNOWN state ◦ 10 launched tasks succeeded ◦ 10 more tasks moved to RUNNING state - queued ◦ ❗ unfortunately, the workflow is stuck in this phase, it seems that when a task enters the queued phase, it cannot move to the launched phase anymore 2. ❗ databricks / webApi / readRateLimiter is not applied ◦ we still see in the logs that more than hundred requests (per sec) are sent to the downstream API, even though we set (QPS = 10, BURST = 20) So it seems that the Redis - Flyte integration has been done successfully, but we still face functional issues. Although both issues are important, the second one is critical. Can we focus on that one?

QUESTIONS

• Is the

webApi / readRateLimiter

config supposed to be applied by Redis ResourceManager? • Do we need any other configurations besides the ones we already shared?

aloof-painting-18735

04/02/2024, 3:59 PM

Our Resource Manager config:

Copy code

resourcemanager:
        resourceMaxQuota: 1000
        redis:
          hostKey: *****
          hostPaths:
          - *****
          maxRetries: 3
        type: redis

full-toddler-5766

04/02/2024, 9:13 PM

Hi @high-park-82026, could you please advise on above? Thank you in advance.

aloof-painting-18735

04/03/2024, 1:12 PM

cc @careful-holiday-56196

freezing-airport-6809

04/03/2024, 2:58 PM

So the problems are read rate limiter and stuck workflow

aloof-painting-18735

04/03/2024, 2:59 PM

yes

freezing-airport-6809

04/03/2024, 2:59 PM

Cc @glamorous-carpet-83516 / @high-accountant-32689 can you please Help here

freezing-airport-6809

04/03/2024, 2:59 PM

@aloof-painting-18735 and team can we have a call. We need to understand correctly

aloof-painting-18735

04/03/2024, 3:02 PM

yeah, let us discuss internally when everyone is available

freezing-airport-6809

04/03/2024, 3:11 PM

Also we are not available till noon

aloof-painting-18735

04/03/2024, 3:12 PM

what timezone?

freezing-airport-6809

04/03/2024, 3:17 PM

PST

aloof-painting-18735

04/03/2024, 3:36 PM

In the meantime, can you please clarify what component is expected to apply the

webApi / readRateLimiter

config? That's the most burning issue for us. It seems these configs are ignored for Databricks plugin. Is it supposed to be applied by Flyte ResourceManager?

aloof-painting-18735

04/03/2024, 4:08 PM

Sorry, we can't make a call today, we're in EU timezone, that's late night for us. Our top priority is the

webApi / readRateLimiter

config. If you could clarify which component (e.g. flyte plugin, redis) is responsible for applying this config, that would be very helpful.

freezing-airport-6809

04/03/2024, 4:22 PM

@aloof-painting-18735 what version of flytepropeller are you running?

freezing-airport-6809

04/03/2024, 4:23 PM

I think this was a bug that was already patched some time ago

full-toddler-5766

04/03/2024, 4:35 PM

propeller version is v1.10.6

gentle-state-35322

04/03/2024, 7:31 PM

@freezing-airport-6809 I can talk later today if you anyone is available and relay back any suggestions to Robert and Gabor

gentle-state-35322

04/03/2024, 7:32 PM

do you have time later today or early tomorrow?

freezing-airport-6809

04/03/2024, 10:28 PM

i think i am confused between 2 threads sorry

freezing-airport-6809

04/03/2024, 10:28 PM

i guess we need to jump on a call together to understand what you folks are seeing

freezing-airport-6809

04/03/2024, 10:28 PM

@gentle-state-35322 when would you have some time?

gentle-state-35322

04/04/2024, 2:54 AM

@freezing-airport-6809 do you have time now 🙂

gentle-state-35322

04/04/2024, 2:55 AM

I know it is a little too late but just checking if not we can connect tomorrow when you have time

freezing-airport-6809

04/04/2024, 4:12 AM

hey @gentle-state-35322 if we want to meet in the morning maybe 10:30 am should work for me

freezing-airport-6809

04/04/2024, 4:12 AM

I have 30 minutes

gentle-state-35322

04/04/2024, 4:27 AM

that works for me

gentle-state-35322

04/04/2024, 4:27 AM

lets talk in the morning

gentle-state-35322

04/04/2024, 4:51 AM

@aloof-painting-18735 and @full-toddler-5766 Based on my conversation with Ketan and providing the context around the issue, the flyte team is going to investigate and may get back to us with possible solution in a day for the config. @freezing-airport-6809 Please look at Robert's message above on enabling Redis resource manager as well and it didn't help in applying the rate limiter config

aloof-painting-18735

04/04/2024, 12:48 PM

Thank you @gentle-state-35322 and @freezing-airport-6809 Can you please confirm that the Redis resource manager is expected to apply the rate limiter configuration?

aloof-painting-18735

04/04/2024, 12:50 PM

I had a quick look at the Flyte codebase and could not find any usage of the ReadRateLimiter config - maybe I just missed it.

gentle-state-35322

04/04/2024, 1:00 PM

@aloof-painting-18735 Ketan was mentioning that he will have an internal discussion some time in the morning today and that includes confirming how the rate limiter config is applied. I will circle back with Ketan around Noon today and share the updates with you

gratitude thank you 1

freezing-airport-6809

04/04/2024, 2:20 PM

Also rate limiter is just in memory

high-park-82026

04/04/2024, 4:43 PM

@gentle-state-35322 @aloof-painting-18735 I'm looking at the resource quota issue now, will keep you updated!

high-park-82026

04/04/2024, 5:31 PM

I just ran a quick test to validate that resource quotas are respected and I do see they are (screenshot)...

Copy code

webApi:
      caching:
        maxSystemFailures: 5
        resyncInterval: 60s #default value is 30s!
        size: 500000
        workers: 10
      readRateLimiter:
        burst: 20 #default value is 100!
        qps: 10
      resourceMeta: null
      resourceQuotas:
        default: 1 #default value is 1000!
      writeRateLimiter:
        burst: 20 #default value is 100!
        qps: 10

This is my relevant config... I set default to

just to make sure I run out of quota quickly... Trying to see if there is an issue with freeing up tokens that might cause this... In the meantime, do you mind enabling INFO logs on propeller and looking for the following lines:

Copy code

Start building a resource manager
to the Redis Qubole set
Too many allocations

@gentle-state-35322 @aloof-painting-18735

gentle-state-35322

04/04/2024, 5:39 PM

@high-park-82026 is this for the databricks webApi config?

gentle-state-35322

04/04/2024, 5:41 PM

what is the resource constraint that you are applying here on the sample config just for the context

gentle-state-35322

04/04/2024, 5:42 PM

and the concern was also on the rate limiter

high-park-82026

04/04/2024, 6:18 PM

Copy code

resourceQuotas:
        default: 1 #default value is 1000!

I didn't set any resource constraints

gentle-state-35322

04/04/2024, 6:38 PM

I guess my question is what plugin are you testing? I am not sure what this resourceQuota is for

gentle-state-35322

04/04/2024, 6:39 PM

oh ok based on the screenshot it looks like you are trying to run something against BigQuery

gentle-state-35322

04/04/2024, 6:39 PM

I see

gentle-state-35322

04/04/2024, 7:08 PM

@high-park-82026 just for clarity and to understand your test correctly, this would have executed the AllocateToken() for the BigQuery job and upon completion propeller would have executed a ReleaseToken() after which the next job can execute. So it will basically wait for the one BigQuery job to complete before executing the next job in the queue. If I had multiple tasks running in parallel this would have caused just one task to execute while the other task would be waiting for the resource to be released. right?

high-park-82026

04/04/2024, 7:21 PM

That's absolutely correct

gentle-state-35322

04/04/2024, 7:39 PM

ok great. So @aloof-painting-18735 can we please update the log level and then we can try & run a workflow with parallel tasks to check if we see any token allocation/release issues for the databricks plugin? I can even create a simple workflow to check things quickly if that helps. Please let me know

👍 1

full-toddler-5766

04/04/2024, 10:58 PM

Hi, our primary concern would be the rate limiting related for the get calls towards the downstream system behind the webapi since these ones are throttling the downstream system.

full-toddler-5766

04/04/2024, 11:16 PM

{"json":{"routine":"databricks-worker-1","src":"plugin.go:164"},"level":"debug","msg":"Get databricks job response%!(EXTRA string=resp, *http.Response=\u0026{429 Too Many Requests 429 HTTP/2.0 2 0 map[Date:[Wed, 03 Apr 2024 07:37:26 GMT] Retry-After:[1] Server:[databricks] ... [Maximum rate of 100 requests per SECOND has been exceeded. Please reduce the rate of requests and try again after 1 second(s)]] {} 0 [] false false map[] ...})","ts":"2024-04-03T07:37:26Z"}

gentle-state-35322

04/04/2024, 11:49 PM

We are talking multiple things here. @full-toddler-5766 can we please update the logging level and look at the log to see if we can identify any issue with resource quotas getting enforced. That should basically help control the number of requests going to databricks through the plugin. @high-park-82026 how does the readratelimiter config get enforced? (edited)

gentle-state-35322

04/04/2024, 11:51 PM

or @full-toddler-5766 are we suggesting that resource quotas are being respected already but the rate limiter config is not getting enforced.

aloof-painting-18735

04/05/2024, 1:35 PM

Hi @freezing-airport-6809 @high-park-82026 Regarding the rate limiter config... I looked into the codebase that is responsible for syncing the DBX job statuses. Let me explain my understanding: • cache.go module is responsible for syncing Flyte node statuses (this is what I see from logs) • I assume this module should enforce the rate limiting • I had a look at the cache.go init and found this:

Copy code

autoRefreshCache, err := cache.NewAutoRefreshCache(name, q.SyncResource,
		workqueue.DefaultControllerRateLimiter(), cfg.ResyncInterval.Duration, cfg.Workers, cfg.Size,
		scope.NewSubScope("cache"))

•

ResyncInterval

Workers

and

Size

configs are respected (it's working in our setup also), but I can't see any utilization of the

webapi ratelimiter

configs, workqueue.DefaultControllerRateLimiter() uses hard coded values (

qps: 10, burst: 100

) Can you please clarify that my understanding is correct and

ratelimiter

configs should be applied here?

freezing-airport-6809

04/05/2024, 4:31 PM

@aloof-painting-18735 are you around now

freezing-airport-6809

04/05/2024, 4:31 PM

this is impossible to debug - like this as we do not have Databricks account and cannot really test that high volume (too expensive)

freezing-airport-6809

04/05/2024, 4:32 PM

for the rate limiter you might be right

freezing-airport-6809

04/05/2024, 4:32 PM

@aloof-painting-18735 is the

resource quota

working?

gentle-state-35322

04/05/2024, 4:38 PM

Hey @freezing-airport-6809 I can chat with you..I synced with Robert and Gabor some time back

gentle-state-35322

04/05/2024, 4:38 PM

do you have few mins to chat?

freezing-airport-6809

04/05/2024, 4:40 PM

i have a meeting at 10

freezing-airport-6809

04/05/2024, 4:44 PM

sent you an invite

freezing-airport-6809

04/05/2024, 4:44 PM

On the other hand I do believe the rate limiter config is not obeyed - I created a PR https://github.com/flyteorg/flyte/pull/5190/files (is this your only problem?)

freezing-airport-6809

04/05/2024, 4:51 PM

if it is just the ratelimiter config, then ^ this should fix it

gentle-state-35322

04/05/2024, 5:06 PM

@freezing-airport-6809 the rate limiter fix that you propose above should potentially fix the second problem that Robert listed above. The first problem still needs to be resolved on the plugin where the resource quotas are being respected for the first set of executions but the subsequent set of execution goes into a queued state forever and never gets into a running state

high-park-82026

04/05/2024, 6:05 PM

oh... is that what you see? I thought it wasn't being used at all...

👀 1

high-park-82026

04/05/2024, 6:05 PM

looking more into that..

high-park-82026

04/05/2024, 6:07 PM

Can you check the logs for

Attempting to finalize resource

There is also a metric

.resource_release_failed

that tracks failures to release resources. Can you check for that too?

gentle-state-35322

04/05/2024, 6:13 PM

sure we will check and let you know but do we know of any potential reason for this to happen. Also I was under the impression that the jobs would be released as and when a task completes & resources are released but looks like it is sending in batches basically waiting for all the jobs to be complete (within the configured resource quotas limit) before release the next set of jobs.

freezing-airport-6809

04/07/2024, 2:01 PM

@gentle-state-35322 / @full-toddler-5766 / @aloof-painting-18735 I think we found another small bug and fixed it https://github.com/flyteorg/flyte/pull/5195 Please deploy the new patches

gratitude thank you 1

full-toddler-5766

04/08/2024, 4:45 PM

Hi @freezing-airport-6809, thank you for providing above PRs. We have done initial testing of these and it goes as follow: • https://github.com/flyteorg/flyte/pull/5195 looks like working now, resourceQuota was is enforced on job submits • https://github.com/flyteorg/flyte/pull/5190 still not enforcing the rates for the get calls towards downstream. Could you please advise on this? cc @gentle-state-35322 @aloof-painting-18735

👍 1

freezing-airport-6809

04/08/2024, 5:17 PM

I am out today but, can you please file a ticket. Hopefully the biggest problem is solved. Reduce number of parallel jobs for now? Cc @glamorous-carpet-83516 this PR worked right

glamorous-carpet-83516

04/08/2024, 5:18 PM

looking

aloof-painting-18735

04/09/2024, 11:46 AM

@freezing-airport-6809 @glamorous-carpet-83516 ticket raised: https://github.com/flyteorg/flyte/issues/5202 cc @gentle-state-35322 @careful-holiday-56196 @full-toddler-5766

3 Views

Open in Slack

Previous Next