Hey folks - has anyone noticed any perf degradatio...
# contribute
g
Hey folks - has anyone noticed any perf degradations going from 1.15.0 -> 1.15.1? We have some nightly test runs that are now reproducibly failing every night in a specific test environment, and they're all tied to use of our agent (which has not changed). The only delta here seems to be flyteadmin, propeller, scheduler and datacatalog have changed from 1.15.0 -> 1.15.1 (copilot is on 1.15.1 and doesn't seem to contribute to the failures). We're still digging in in an effort to figure out what changed -- but my current hunches are something to do with resource utilization, flytepropeller / agent interaction or some scheduling differences. Sound familiar to anyone? (I took a look through the git history and nothing stood out) This is definitely a weird one! cc @gentle-umbrella-41187
d
if its agent task
when upgrading
maybe its related to this PR
do you change the config of the agent?
right now we will only wait for 3 seconds to get the agent server's response when querying supported task type
g
Carrying over my comment from the other thread in #CP2HDHKE1 where I cross-posted: This turned out to be a combination of a small change in Flytes handling of token scopes in https://github.com/flyteorg/flyte/pull/6336 for 1.15.1 coupled with a specific difference in test environment concurrency + some tests that are arguably doing bad things with tokens 🙂 @gentle-umbrella-41187 was able to debug and track things down and can add more color here, but I don't think yet that there will need to be any changes needed on the Flyte side
The weird part was that the behavior really only showed up when run in a very specific hardware environment - so it was partially a test ordering issue
Thanks for the heads up on that new propeller setting @damp-lion-88352!