full-toddler-5766
02/24/2024, 11:41 PMwebapi.ResourceQuotas
https://docs.flyte.org/en/latest/deployment/configuration/generated/flytepropeller_config.html#resourcequotas-webapi-resourcequotas? What happens when this limit is hit? Is this an enforcing limit if configured in FlytePropeller for a FlytePlugin?average-finland-92144
02/27/2024, 4:59 PMhallowed-mouse-14616
02/27/2024, 6:20 PMglamorous-carpet-83516
02/27/2024, 6:55 PMfull-toddler-5766
02/29/2024, 10:51 PMfull-toddler-5766
02/29/2024, 10:53 PMglamorous-carpet-83516
03/25/2024, 5:26 PMglamorous-carpet-83516
03/25/2024, 5:26 PMhigh-park-82026
kubectl port-forward -n flyte deploy/flytepropeller 10254
Then go to the browser and visit: http://localhost:10254/config
You should find "resourceQuotas" with the values you specified. Mind sending a screenshot of that?aloof-painting-18735
03/26/2024, 6:00 AMaloof-painting-18735
03/26/2024, 12:44 PMdatabricks:
enabled: true
upload_entrypoint: true
plugin_config:
plugins:
databricks:
databricksInstance: <!--- set to our DBX instance --->
# this is the entrypoint.py for flyte on databricks
entrypointFile: <!--- set to our DBX entrypoint file location --->
# this file is mounted by vault agent injector at /vault/secrets
databricksTokenKey: <!--- set to our DBX token key --->
webApi:
caching:
maxSystemFailures: 5
resyncInterval: 60s #default value is 30s!
size: 500000
workers: 10
readRateLimiter:
burst: 20 #default value is 100!
qps: 10
resourceMeta: null
resourceQuotas:
default: 10 #default value is 1000!
writeRateLimiter:
burst: 20 #default value is 100!
qps: 10
aloof-painting-18735
03/26/2024, 12:50 PMreadRateLimiter
settings are not applied
• Flyte is ignoring resourceQuotas / default
value - we set it to 10 and 25 Spark tasks are running simultaneously (that's the limit set by flyteadmin / maxParallelism)aloof-painting-18735
03/26/2024, 12:56 PMaloof-painting-18735
03/26/2024, 1:04 PMwebApi
config for Flyte Databricks plugin. We have come across the Flyte ResourceManager page, actually we have this Propeller config:
propeller:
resourcemanager:
type: noop
@glamorous-carpet-83516 Can you please clarify whether we need to setup a ResourceManager to apply settings in webApi
config?aloof-painting-18735
03/26/2024, 1:24 PMaloof-painting-18735
03/26/2024, 1:24 PMgentle-state-35322
03/26/2024, 8:13 PMgentle-state-35322
03/26/2024, 8:15 PMfreezing-airport-6809
gentle-state-35322
03/26/2024, 11:42 PMgentle-state-35322
03/26/2024, 11:43 PMfreezing-airport-6809
gentle-state-35322
03/26/2024, 11:52 PMhigh-park-82026
propeller:
resourcemanager:
type: redis
redis:
hostPaths:
- <redis replica 1>...
hostKey: <password>
maxRetries: 3
aloof-painting-18735
03/27/2024, 4:44 PMwebApi
conf?aloof-painting-18735
03/27/2024, 4:56 PMhigh-park-82026
aloof-painting-18735
03/27/2024, 5:03 PMgentle-state-35322
03/27/2024, 5:05 PMgentle-state-35322
03/27/2024, 5:05 PMhigh-park-82026
freezing-airport-6809
gentle-state-35322
03/29/2024, 4:48 AMaloof-painting-18735
04/02/2024, 3:54 PMaloof-painting-18735
04/02/2024, 3:55 PMOBSERVATIONS
1. databricks / resourceQuotas is applied successfully
◦ 10 tasks in RUNNING state - launched (resourceQuotas)
◦ 15 tasks in RUNNING state - queued (max_parallelism - resourceQuotas)
◦ all the remaining tasks in UNKNOWN state
◦ 10 launched tasks succeeded
◦ 10 more tasks moved to RUNNING state - queued
◦ ❗ unfortunately, the workflow is stuck in this phase, it seems that when a task enters the queued phase, it cannot move to the launched phase anymore
2. ❗ databricks / webApi / readRateLimiter is not applied
◦ we still see in the logs that more than hundred requests (per sec) are sent to the downstream API, even though we set (QPS = 10, BURST = 20)
So it seems that the Redis - Flyte integration has been done successfully, but we still face functional issues.
Although both issues are important, the second one is critical. Can we focus on that one?
QUESTIONS
• Is the webApi / readRateLimiter
config supposed to be applied by Redis ResourceManager?
• Do we need any other configurations besides the ones we already shared?aloof-painting-18735
04/02/2024, 3:59 PMresourcemanager:
resourceMaxQuota: 1000
redis:
hostKey: *****
hostPaths:
- *****
maxRetries: 3
type: redis
full-toddler-5766
04/02/2024, 9:13 PMaloof-painting-18735
04/03/2024, 1:12 PMfreezing-airport-6809
aloof-painting-18735
04/03/2024, 2:59 PMfreezing-airport-6809
freezing-airport-6809
aloof-painting-18735
04/03/2024, 3:02 PMfreezing-airport-6809
aloof-painting-18735
04/03/2024, 3:12 PMfreezing-airport-6809
aloof-painting-18735
04/03/2024, 3:36 PMwebApi / readRateLimiter
config? That's the most burning issue for us. It seems these configs are ignored for Databricks plugin. Is it supposed to be applied by Flyte ResourceManager?aloof-painting-18735
04/03/2024, 4:08 PMwebApi / readRateLimiter
config. If you could clarify which component (e.g. flyte plugin, redis) is responsible for applying this config, that would be very helpful.freezing-airport-6809
freezing-airport-6809
full-toddler-5766
04/03/2024, 4:35 PMgentle-state-35322
04/03/2024, 7:31 PMgentle-state-35322
04/03/2024, 7:32 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
gentle-state-35322
04/04/2024, 2:54 AMgentle-state-35322
04/04/2024, 2:55 AMfreezing-airport-6809
freezing-airport-6809
gentle-state-35322
04/04/2024, 4:27 AMgentle-state-35322
04/04/2024, 4:27 AMgentle-state-35322
04/04/2024, 4:51 AMaloof-painting-18735
04/04/2024, 12:48 PMaloof-painting-18735
04/04/2024, 12:50 PMgentle-state-35322
04/04/2024, 1:00 PMfreezing-airport-6809
high-park-82026
high-park-82026
webApi:
caching:
maxSystemFailures: 5
resyncInterval: 60s #default value is 30s!
size: 500000
workers: 10
readRateLimiter:
burst: 20 #default value is 100!
qps: 10
resourceMeta: null
resourceQuotas:
default: 1 #default value is 1000!
writeRateLimiter:
burst: 20 #default value is 100!
qps: 10
This is my relevant config... I set default to 1
just to make sure I run out of quota quickly...
Trying to see if there is an issue with freeing up tokens that might cause this...
In the meantime, do you mind enabling INFO logs on propeller and looking for the following lines:
Start building a resource manager
to the Redis Qubole set
Too many allocations
@gentle-state-35322 @aloof-painting-18735gentle-state-35322
04/04/2024, 5:39 PMgentle-state-35322
04/04/2024, 5:41 PMgentle-state-35322
04/04/2024, 5:42 PMhigh-park-82026
resourceQuotas:
default: 1 #default value is 1000!
I didn't set any resource constraintsgentle-state-35322
04/04/2024, 6:38 PMgentle-state-35322
04/04/2024, 6:39 PMgentle-state-35322
04/04/2024, 6:39 PMgentle-state-35322
04/04/2024, 7:08 PMhigh-park-82026
gentle-state-35322
04/04/2024, 7:39 PMfull-toddler-5766
04/04/2024, 10:58 PMfull-toddler-5766
04/04/2024, 11:16 PM{"json":{"routine":"databricks-worker-1","src":"plugin.go:164"},"level":"debug","msg":"Get databricks job response%!(EXTRA string=resp, *http.Response=\u0026{429 Too Many Requests 429 HTTP/2.0 2 0 map[Date:[Wed, 03 Apr 2024 07:37:26 GMT] Retry-After:[1] Server:[databricks] ... [Maximum rate of 100 requests per SECOND has been exceeded. Please reduce the rate of requests and try again after 1 second(s)]] {} 0 [] false false map[] ...})","ts":"2024-04-03T07:37:26Z"}
gentle-state-35322
04/04/2024, 11:49 PMgentle-state-35322
04/04/2024, 11:51 PMaloof-painting-18735
04/05/2024, 1:35 PMautoRefreshCache, err := cache.NewAutoRefreshCache(name, q.SyncResource,
workqueue.DefaultControllerRateLimiter(), cfg.ResyncInterval.Duration, cfg.Workers, cfg.Size,
scope.NewSubScope("cache"))
• ResyncInterval
, Workers
and Size
configs are respected (it's working in our setup also), but I can't see any utilization of the webapi ratelimiter
configs, workqueue.DefaultControllerRateLimiter() uses hard coded values (qps: 10, burst: 100
)
Can you please clarify that my understanding is correct and ratelimiter
configs should be applied here?freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
resource quota
working?gentle-state-35322
04/05/2024, 4:38 PMgentle-state-35322
04/05/2024, 4:38 PMfreezing-airport-6809
freezing-airport-6809
freezing-airport-6809
freezing-airport-6809
gentle-state-35322
04/05/2024, 5:06 PMhigh-park-82026
high-park-82026
high-park-82026
Attempting to finalize resource
There is also a metric .resource_release_failed
that tracks failures to release resources. Can you check for that too?gentle-state-35322
04/05/2024, 6:13 PMfreezing-airport-6809
full-toddler-5766
04/08/2024, 4:45 PMfreezing-airport-6809
glamorous-carpet-83516
04/08/2024, 5:18 PMaloof-painting-18735
04/09/2024, 11:46 AM