Hi, Could you please advise about the workings of ...
# ask-the-community
g
Hi, Could you please advise about the workings of the
webapi.ResourceQuotas
https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#resourcequotas-webapi-resourcequotas? What happens when this limit is hit? Is this an enforcing limit if configured in FlytePropeller for a FlytePlugin?
d
It seems to be part of the plugin config yes (see). As to how exactly Propeller handles a situation when the limits are hit, I'm not sure. Maybe @Dan Rammer (hamersaw) knows best?
d
cc @Kevin Su you're intimately familiar with the web API resource quota from agents work right? if not I can dive through the code and figure this out.
k
@GF yup, Propeller won’t submit a new job if the limits are reached.
g
Thank you for the feedback. Interestingly I set the value to 1 and with 4 independent tasks I see them dispatched towards the downstream system at the same time.
Could this be overridden somehow?
k
did you restart the propeller after you change the config?
also, could you share your propeller config, especially webapi config
h
@GF are you running propeller manager (multiple propellers), without connecting this to Redis, each propeller will have its own view of consumed resources in memory and you will end up with a situation where multiple tasks will run.. Can you do this for me to check that propeller got the right config applied:
Copy code
kubectl port-forward -n flyte deploy/flytepropeller 10254
Then go to the browser and visit: http://localhost:10254/config You should find "resourceQuotas" with the values you specified. Mind sending a screenshot of that?
r
Hi @Kevin Su @Haytham Abuelfutuh @GF is off this week, let me jump in and follow up this thread. Today, I'll reproduce the issue and share all the details.
This is the full config for Databricks Flyte plugin:
Copy code
databricks:
  enabled: true
  upload_entrypoint: true
  plugin_config:
    plugins:
      databricks:
        databricksInstance: <!--- set to our DBX instance --->
        # this is the entrypoint.py for flyte on databricks
        entrypointFile: <!--- set to our DBX entrypoint file location --->
        # this file is mounted by vault agent injector at /vault/secrets
        databricksTokenKey: <!--- set to our DBX token key --->
        webApi:
          caching:
            maxSystemFailures: 5
            resyncInterval: 60s #default value is 30s!
            size: 500000
            workers: 10
          readRateLimiter:
            burst: 20 #default value is 100!
            qps: 10
          resourceMeta: null
          resourceQuotas:
            default: 10 #default value is 1000!
          writeRateLimiter:
            burst: 20 #default value is 100!
            qps: 10
our observations with the above settings: • Flyte is hitting Databricks API rate limit -
readRateLimiter
settings are not applied • Flyte is ignoring
resourceQuotas / default
value - we set it to 10 and 25 Spark tasks are running simultaneously (that's the limit set by flyteadmin / maxParallelism)
image.png
It feels like we miss some component that applies the
webApi
config for Flyte Databricks plugin. We have come across the Flyte ResourceManager page, actually we have this Propeller config:
Copy code
propeller:
      resourcemanager:
        type: noop
@Kevin Su Can you please clarify whether we need to setup a ResourceManager to apply settings in
webApi
config?
@Haytham Abuelfutuh we have single propeller, the requested screenshot:
image.png
a
@Kevin Su @Haytham Abuelfutuh Kindly let us know if you need more details. Our engineering team is stuck and would appreciate any suggestions. cc: @Ketan (kumare3)
We are running our workflows in Production and it is critical for us to resolve resource constraints
k
@anantharaman janakiraman as you know we are an open source community- we are happy to help but we cannot provide sla’s. Shall we work on a support plan to provide slas Also I am in meetings all day and haytham is the cto at union. Please help us prioritize
a
sure @Ketan (kumare3) totally understand. Sorry that we have to bother your team on this issue. We tried few things before reaching out to the community but couldn't get to a resolution.
Thanks again for helping us out!
k
We will absolutely help. Thank you for the patience
a
sharing the go script for the databricks plugin config just for context: https://github.com/flyteorg/flyte/blob/aedde593828d95242df2df93249f897b3ac05d2c/flyteplugins/go/tasks/plugins/webapi/databricks/config.go#L19[…]9C13 ..Robert provided the config that was used in our deployment above
h
@anantharaman janakiraman @Robert Ambrus ah, thank you for sharing propeller config. At the moment the only two implementations for resource manager are Noop (always allow) and Redis (require a redis connection). I mistakenly thought there was an in memory implementation but turns out there aren't. I think the best course of action for your team to do here is to standup a Redis Instance (or AWS MemCache if on AWS) and configure flytepropeller to use that as the backing storage for the resource manager. Something like this:
Copy code
propeller:
  resourcemanager:
    type: redis
    redis:
      hostPaths:
        - <redis replica 1>...
      hostKey: <password>
      maxRetries: 3
r
@Haytham Abuelfutuh I see, thanks for your confirmation! Let us set up a resource manager and see if it solves the problem. Do we need to make any other changes to the apply
webApi
conf?
@Haytham Abuelfutuh Are there any special requirements for Flyte plugins to work with ResourceManager? Our current setup relies on the legacy Databricks plugin.
h
I'm looking through your config, I believe you have all the pieces for this to work... once a redis resource manager is initialized, the plugin will automatically start using it.
r
All right, thank you for your help! We'll setup redis resource manager and get back to you with the results.
a
Thanks @Haytham Abuelfutuh !
our team will try it out and will update
h
Awesome! wish you all the best... we are here to support you... and I appreciate your understanding of how busy everyone is...
k
@anantharaman janakiraman did that work?
a
Robert is working on it @Ketan (kumare3). I will check with Robert and confirm
r
Hi @Ketan (kumare3) @Haytham Abuelfutuh @anantharaman janakiraman
We have successfully set up a Redis instance and connected to Flyte. Tried to run a dynamic workflow with 99 tasks, did not override the default max_parallelism value (25).
OBSERVATIONS
1. databricks / resourceQuotas is applied successfully ◦ 10 tasks in RUNNING state - launched (resourceQuotas) ◦ 15 tasks in RUNNING state - queued (max_parallelism - resourceQuotas) ◦ all the remaining tasks in UNKNOWN state ◦ 10 launched tasks succeeded ◦ 10 more tasks moved to RUNNING state - queued ◦ unfortunately, the workflow is stuck in this phase, it seems that when a task enters the queued phase, it cannot move to the launched phase anymore 2. databricks / webApi / readRateLimiter is not applied ◦ we still see in the logs that more than hundred requests (per sec) are sent to the downstream API, even though we set (QPS = 10, BURST = 20) So it seems that the Redis - Flyte integration has been done successfully, but we still face functional issues. Although both issues are important, the second one is critical. Can we focus on that one?
QUESTIONS
• Is the
webApi / readRateLimiter
config supposed to be applied by Redis ResourceManager? • Do we need any other configurations besides the ones we already shared?
Our Resource Manager config:
Copy code
resourcemanager:
        resourceMaxQuota: 1000
        redis:
          hostKey: *****
          hostPaths:
          - *****
          maxRetries: 3
        type: redis
g
Hi @Haytham Abuelfutuh, could you please advise on above? Thank you in advance.
r
cc @Aarthi Vellingiri
k
So the problems are read rate limiter and stuck workflow
r
yes
k
Cc @Kevin Su / @Eduardo Apolinario (eapolinario) can you please Help here
@Robert Ambrus and team can we have a call. We need to understand correctly
r
yeah, let us discuss internally when everyone is available
k
Also we are not available till noon
r
what timezone?
k
PST
r
In the meantime, can you please clarify what component is expected to apply the
webApi / readRateLimiter
config? That's the most burning issue for us. It seems these configs are ignored for Databricks plugin. Is it supposed to be applied by Flyte ResourceManager?
Sorry, we can't make a call today, we're in EU timezone, that's late night for us. Our top priority is the
webApi / readRateLimiter
config. If you could clarify which component (e.g. flyte plugin, redis) is responsible for applying this config, that would be very helpful.
k
@Robert Ambrus what version of flytepropeller are you running?
I think this was a bug that was already patched some time ago
g
propeller version is v1.10.6
a
@Ketan (kumare3) I can talk later today if you anyone is available and relay back any suggestions to Robert and Gabor
do you have time later today or early tomorrow?
k
i think i am confused between 2 threads sorry
i guess we need to jump on a call together to understand what you folks are seeing
@anantharaman janakiraman when would you have some time?
a
@Ketan (kumare3) do you have time now 🙂
I know it is a little too late but just checking if not we can connect tomorrow when you have time
k
hey @anantharaman janakiraman if we want to meet in the morning maybe 10:30 am should work for me
I have 30 minutes
a
that works for me
lets talk in the morning
@Robert Ambrus and @GF Based on my conversation with Ketan and providing the context around the issue, the flyte team is going to investigate and may get back to us with possible solution in a day for the config. @Ketan (kumare3) Please look at Robert's message above on enabling Redis resource manager as well and it didn't help in applying the rate limiter config
r
Thank you @anantharaman janakiraman and @Ketan (kumare3) Can you please confirm that the Redis resource manager is expected to apply the rate limiter configuration?
I had a quick look at the Flyte codebase and could not find any usage of the ReadRateLimiter config - maybe I just missed it.
a
@Robert Ambrus Ketan was mentioning that he will have an internal discussion some time in the morning today and that includes confirming how the rate limiter config is applied. I will circle back with Ketan around Noon today and share the updates with you
k
Also rate limiter is just in memory
h
@anantharaman janakiraman @Robert Ambrus I'm looking at the resource quota issue now, will keep you updated!
I just ran a quick test to validate that resource quotas are respected and I do see they are (screenshot)...
Copy code
webApi:
      caching:
        maxSystemFailures: 5
        resyncInterval: 60s #default value is 30s!
        size: 500000
        workers: 10
      readRateLimiter:
        burst: 20 #default value is 100!
        qps: 10
      resourceMeta: null
      resourceQuotas:
        default: 1 #default value is 1000!
      writeRateLimiter:
        burst: 20 #default value is 100!
        qps: 10
This is my relevant config... I set default to
1
just to make sure I run out of quota quickly... Trying to see if there is an issue with freeing up tokens that might cause this... In the meantime, do you mind enabling INFO logs on propeller and looking for the following lines:
Copy code
Start building a resource manager
to the Redis Qubole set
Too many allocations
@anantharaman janakiraman @Robert Ambrus
a
@Haytham Abuelfutuh is this for the databricks webApi config?
what is the resource constraint that you are applying here on the sample config just for the context
and the concern was also on the rate limiter
h
Copy code
resourceQuotas:
        default: 1 #default value is 1000!
I didn't set any resource constraints
a
I guess my question is what plugin are you testing? I am not sure what this resourceQuota is for
oh ok based on the screenshot it looks like you are trying to run something against BigQuery
I see
@Haytham Abuelfutuh just for clarity and to understand your test correctly, this would have executed the AllocateToken() for the BigQuery job and upon completion propeller would have executed a ReleaseToken() after which the next job can execute. So it will basically wait for the one BigQuery job to complete before executing the next job in the queue. If I had multiple tasks running in parallel this would have caused just one task to execute while the other task would be waiting for the resource to be released. right?
h
That's absolutely correct
a
ok great. So @Robert Ambrus can we please update the log level and then we can try & run a workflow with parallel tasks to check if we see any token allocation/release issues for the databricks plugin? I can even create a simple workflow to check things quickly if that helps. Please let me know
g
Hi, our primary concern would be the rate limiting related for the get calls towards the downstream system behind the webapi since these ones are throttling the downstream system.
{"json":{"routine":"databricks-worker-1","src":"plugin.go:164"},"level":"debug","msg":"Get databricks job response%!(EXTRA string=resp, *http.Response=\u0026{429 Too Many Requests 429 HTTP/2.0 2 0 map[Date:[Wed, 03 Apr 2024 07:37:26 GMT] Retry-After:[1] Server:[databricks] ... [Maximum rate of 100 requests per SECOND has been exceeded. Please reduce the rate of requests and try again after 1 second(s)]] {} 0 [] false false map[] ...})","ts":"2024-04-03T07:37:26Z"}
a
We are talking multiple things here. @GF can we please update the logging level and look at the log to see if we can identify any issue with resource quotas getting enforced. That should basically help control the number of requests going to databricks through the plugin. @Haytham Abuelfutuh how does the readratelimiter config get enforced? (edited)
or @GF are we suggesting that resource quotas are being respected already but the rate limiter config is not getting enforced.
r
Hi @Ketan (kumare3) @Haytham Abuelfutuh Regarding the rate limiter config... I looked into the codebase that is responsible for syncing the DBX job statuses. Let me explain my understanding: • cache.go module is responsible for syncing Flyte node statuses (this is what I see from logs) • I assume this module should enforce the rate limiting • I had a look at the cache.go init and found this:
Copy code
autoRefreshCache, err := cache.NewAutoRefreshCache(name, q.SyncResource,
		workqueue.DefaultControllerRateLimiter(), cfg.ResyncInterval.Duration, cfg.Workers, cfg.Size,
		scope.NewSubScope("cache"))
ResyncInterval
,
Workers
and
Size
configs are respected (it's working in our setup also), but I can't see any utilization of the
webapi ratelimiter
configs, workqueue.DefaultControllerRateLimiter() uses hard coded values (
qps: 10, burst: 100
) Can you please clarify that my understanding is correct and
ratelimiter
configs should be applied here?
k
@Robert Ambrus are you around now
this is impossible to debug - like this as we do not have Databricks account and cannot really test that high volume (too expensive)
for the rate limiter you might be right
@Robert Ambrus is the
resource quota
working?
a
Hey @Ketan (kumare3) I can chat with you..I synced with Robert and Gabor some time back
do you have few mins to chat?
k
i have a meeting at 10
sent you an invite
On the other hand I do believe the rate limiter config is not obeyed - I created a PR https://github.com/flyteorg/flyte/pull/5190/files (is this your only problem?)
if it is just the ratelimiter config, then ^ this should fix it
a
@Ketan (kumare3) the rate limiter fix that you propose above should potentially fix the second problem that Robert listed above. The first problem still needs to be resolved on the plugin where the resource quotas are being respected for the first set of executions but the subsequent set of execution goes into a queued state forever and never gets into a running state
h
oh... is that what you see? I thought it wasn't being used at all...
looking more into that..
Can you check the logs for
Attempting to finalize resource
There is also a metric
.resource_release_failed
that tracks failures to release resources. Can you check for that too?
a
sure we will check and let you know but do we know of any potential reason for this to happen. Also I was under the impression that the jobs would be released as and when a task completes & resources are released but looks like it is sending in batches basically waiting for all the jobs to be complete (within the configured resource quotas limit) before release the next set of jobs.
k
@anantharaman janakiraman / @GF / @Robert Ambrus I think we found another small bug and fixed it https://github.com/flyteorg/flyte/pull/5195 Please deploy the new patches
g
Hi @Ketan (kumare3), thank you for providing above PRs. We have done initial testing of these and it goes as follow: • https://github.com/flyteorg/flyte/pull/5195 looks like working now, resourceQuota was is enforced on job submits • https://github.com/flyteorg/flyte/pull/5190 still not enforcing the rates for the get calls towards downstream. Could you please advise on this? cc @anantharaman janakiraman @Robert Ambrus
k
I am out today but, can you please file a ticket. Hopefully the biggest problem is solved. Reduce number of parallel jobs for now? Cc @Kevin Su this PR worked right
k
looking
r
@Ketan (kumare3) @Kevin Su ticket raised: https://github.com/flyteorg/flyte/issues/5202 cc @anantharaman janakiraman @Aarthi Vellingiri @GF