We’ve seen something similar lately (only happenin...
# ask-the-community
b
We’ve seen something similar lately (only happening in a
map_task
though) - so following along here for any debugging tips
f
I don’t have a map task but I create tasks in a for loop in the workflow.
n
Hi Fabio, are those tasks using files or tabular data as input? If not what’s the type signature?
f
No, not using any tabular data or files that would be expected to brick the cache:
Copy code
def train(
    task_name: str,
    param: List[int],
    stage: StageConfig,
    base_path: str = BASE_PATH,
    warm_start_path: Optional[str] = None,
) -> str:
StageConfig
is a
dataclass_json
that also has other `dataclass_json`s nested under it.
n
Perhaps a related tangent, but are you specifying defaults in the task signature? I wasn’t aware that defaults were supported in tasks
f
Yes, I am 🤔 But I have been doing this before as well.
(Didn’t know I wasn’t supposed to ahah)
n
@Eduardo Apolinario (eapolinario) @Yee @Dan Rammer (hamersaw) I think one of the Q1 roadmap items is related to this ^^
k
Yes defaults are not supported - hmm this should have failed compile time. Only none default is allowed - I.e tasks can have optional inputs now - so that you can rev versions
Base_path is weird. Warm_start_path is ok
n
would that throw-off the cache though?
k
Hmm I am not sure how it will be handled, I think it will be python side only - so it might just work fine
I don’t think it should impact cache
Cache seems to be something else
e
@Fabio Grätz, this is in a dynamic task, right?
d
@Fabio Grätz, I can dive through the backend code later today and make sure the optional inputs aren't an issue. Then I will circle back with you for any additional debugging info. This is a correctness issue, and therefore the highest priority.
f
Hey 🙂 I’m currently trying to create a minimal working example from the full 40h 40 task workflow from the ML engineer in our team that reported this.
I already see some stuff in the workflow that could be problematic, e.g.
def train_stage(…) -> (str, str):
instead of
Tuple[str, str]
.
I think until I have a minimal working example that reproduces this, I’m not sure what you could do honestly since I also have never seen this issue before and I did use task defaults 🤔
d
@Bernhard Stadlbauer there are some known issues with cache reporting for map task subtasks. I have a PR out into flyteplugins and there is another issue in flyteconsole to fix this. So the UI is not correctly displaying the cache status - however, in the backend everything should be cached correctly. I know this makes this difficult to understand - but it should not affect correctness.
b
@Dan Rammer (hamersaw) Nice, thank you!
d
no problem! if you think there are larger issues than the reporting. like if, for example, the task durations seem large for something that should have been a cache hit, please let us know. i would be happy to dive into this. it would probably be a bit easier to debug once we get the aforementioned subtask cache status reporting things merged.
b
Perfect will do! This was also reported by one of our science teams, will monitor now 👍
f
@Dan Rammer (hamersaw) @Eduardo Apolinario (eapolinario) I have a minimal working example that for me consistently fails to cache the same set of the successful tasks. It appears to not be a UI issue only as, when relaunching the workflow, the respective pods do run again. For me, the task
generate_report
fails to cache all 4 times. The task
full_evaluation
in the end fails as well. In this example, all the tasks that fail, thus, are tasks that don’t have a node after them. In the full >50h, ~40 tasks training, some of the
train
tasks (which do have a task after them in the graph) failed to cache as well. This I cannot reproduce. However, the ML engineer who wrote the workflow didn’t run the workflow in one single execution end to end due to intermittent failures. The restarts might, thus, have had some influence as well 🤔 Since the the full execution takes so long, we haven’t run it again. So far, I have only focused on removing logic while making sure that the cache failure persists. I haven’t yet tried to remove defaults in the tasks etc.
d
@Fabio Grätz thanks so much for this! I will block some time this afternoon for a deep-dive into this, because I suspect it's going to need one 😄
so this is itching at me, not sure I can wait until this afternoon. The
generate_report
is the function we want to cache right? I am pretty sure we check if there are task outputs, if there are none then there is nothing to cache and we skip it.
f
😿 I think I just wasted 2h creating a minimal working example that doesn’t reproduce the issue
I’m adding return values to
generate_report
and
full_evaluation
.
d
... yeah, can confirm that if there are no outputs we disable caching automatically. sorry! please let us know if there's anything we can help with reproducing this. it's very troublesome to hear.
f
No worries! Unfortunately that means that I cannot reproduce the errors in the full 50h training.
In the original workflow, the
generate_report
doesn’t have a return value so that one is .
full_evaluation
also has no return value -> 🤦 silly me … There are two other tasks that do have return values that also showed this behaviour. When trying to construct a minimal working example, I didn’t see this at all today though. The screenshot with the task
search
shows what I called
try_params
in the “minimal working example” (renamed it to avoid internal lingo):
Copy code
@task(
    cache=True,
    cache_version="0.1",
    requests=Resources(gpu="1", cpu="11", mem="44G"),
    disable_deck=False,
)
def try_params(
    ...
) -> str:

    if _is_distributed_worker():
        raise flytekit.core.base_task.IgnoreOutputs()
    else:
        return wandb.run.path
The if statement makes no sense since this is not a pytorch task but should never be True since it checks if
RANK
is set and != 0 which is never the case for Python tasks. For this task, the screenshot shows that two instances running in parallel one time worked, one time didn’t. Then there is the
train
task:
Copy code
@task(
    cache=True,
    cache_version="0.1",
    task_config=PyTorch(
        num_workers=3,
    ),
    requests=Resources(gpu="1", cpu="11", mem="44G"),
    disable_deck=False,
)
def train(
    ....
) -> (str, str):

    if _is_distributed_worker():
        raise flytekit.core.base_task.IgnoreOutputs()
    else:
        return wandb.run.path, out_dir
Screenshot 2023-01-13 at 18.21.00.png,Screenshot 2023-01-13 at 18.23.46.png,Screenshot 2023-01-13 at 18.24.20.png
@Dan Rammer (hamersaw) do you know what happens in a distributed pytorch task if RANK 0 returns the true value but the other workers raise
flytekit.core.base_task.IgnoreOutputs()
? If one worker with RANK!=0 finishes first and returns nothing, could this lead to the cache being deactivated?
d
yeah, i'm taking a look here and not seeing anything. i don't suspect this has anything to do with pytorch vs python task because all of the caching happens in propeller. of course, i could be wrong. i'll have to take a deeper look - it might be back to this afternoon with all the moving parts i'm trying to juggle right now.
f
Hey @Dan Rammer (hamersaw), @Eduardo Apolinario (eapolinario), after I sadly wasn’t able to create a minimal working example that reproduced the cache issues, I reran the full training (albeit with 20 instead of 40 tasks which took 10h). The same cache issues happened again unfortunately. I went through the logs of flytepropeller, datacatalog, and the postgres database and found that for all 3 tasks with cache issues, the logs contained errors like the following almost exactly at the termination time of the tasks: • “Unable to create dataset” • “Failed to write result to catalog for task” • “duplicate key value violates unique constraint ‘dataset_pkey’” • “Dataset already exists” This document contains all relevant logs I could find. Could you please take a look at them? If there is anything else I can try or search for in the logs, please let me know!
d
Fabio, thanks so much for this. I was going to circle back with you today (we were off yesterday for US holiday). Ran some tests with the
IgnoreOutputs
stuff - propeller should fail the task if it tries to cache it and the outputs were ignored, so that can not be it. The next things was that propeller transparently fails caching. So exactly what you are seeing, the caching mechanism runs into a failure and propeller will still mark the task as succeeded but just not cached. We should really make this more observable. All of the "Failed to write result to catalog for task" indicates this is what is happening. I will dive into this.
f
Yes, I also tested whether it could be
IgnoreOutputs
by intentionally delaying RANK 0 (or every worker but RANK 0) to intentionally trigger a potential race condition. But the correct return value from RANK 0 was always retrieved…
Please let me know if there is anything else I can search for in the logs
d
Yeah, so this is an error in the datacatalog. I know exactly where it happens - just need to figure out how to repro.
f
One thing I cannot judge whether it is important: We register workflows and start executions with flyte remote (because we automatically retrieve the version information from git). We make use of passing custom execution names to
remote.execute
. The execution id of the run was
fc3e15e42c6ec4043b46-sbs-9
.
Our execution names are generated similarly to what flytekit does, only that we allow a user defined prefix + uid:
Copy code
uuid_len = 20
    value = value + "-" + uuid.uuid4().hex[:uuid_len]
But I started more than 100 executions with this only on last Friday when trying to create a minimal working example, didn’t observe any issues caused by that.
Yeah, so this is an error in the datacatalog. I know exactly where it happens - just need to figure out how to repro.
If there is anything I can help with, I’m happy to search for more stuff in the logs or try to execute stuff. Just ping
d
On quick question, in the datacatalog logs there are a few
DatasetKey
instances printed off that you have blurred out (reasonably so), I don't necessarily care about the values but are the fields
Project
,
Domain
,
Name
, and
Version
? There is on
UUID
correct?
f
Give me one sec, I’ll check again
d
Also, do you know which version of flytepropeller you are running?
f
Copy code
{"json":{…}, "level":"error", "msg":"Failed to create dataset model: &{BaseModel:{CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:<nil>} DatasetKey:{Project:object_detection Name:flyte_task-<package name>.applications.<application_name>.<dataset_name>.train_workflow.train_stage Domain:development Version:0.1-NGVJxIhX-egfCQQnT UUID:} SerializedMetadata:[10 67 10 12 116 97 115 107 45 118 101 114 115 105 111 110 18 51 108 117 107 97 115 45 102 101 97 116 45 107 105 116 116 105 95 99 98 100 49 98 102 56 48 95 50 48 50 51 45 48 49 45 49 54 95 48 54 45 53 52 45 50 51 95 100 105 114 116 121] PartitionKeys:[]} err: unexpected error type for: write tcp 10.52.1.3:58112->10.22.0.6:5432: write: broken pipe", "ts":"2023-01-17T01:28:18Z"}
Where I redacted internal stuff I was asked to remove, I inserted placeholders like
<dataset name>
. I explicitly did not change the part where it says
UUID:}
d
nvm about the propeller version. this is a datacatalog thing.
f
image: <http://cr.flyte.org/flyteorg/datacatalog-release:v1.2.1|cr.flyte.org/flyteorg/datacatalog-release:v1.2.1>
Flytepropeller is actually one I built myself I just realized - final state of the PR in which I added task templates to pytorch jobs!
d
OK - so in the error you sent above here this "broken pipe" is very suspect. In the doc you mentioned that this comes from dataset:36, is that here ?
I'm wondering if there is an issue with the google cloud SQL server and GORM. We use the return code of the
h.db.Create
call to check if the item already exists or not. It seems the error that is returned from that call is not an "AlreadyExists" error so we identify it as something more serious.
Can we try something very simple - can we cache two different items for the same task version? Something like:
Copy code
@task(cache=True, cache_version="1.0")
def hello_world(name: str) -> str:
    return f"hello {name}"

@workflow
def hello_world_wf(name: str) -> str:
    return hello_world(name=name)
and try calling it with different values, ie:
foo
,
bar
and see what the behavior is. I fear that the call to
bar
will have the same result as here; namely, GORM isn't detecting the dataset "AlreadyExists" and the cache put fails and
bar
is not cached.
Maybe keep an eye on the logs if it is possible just to validate we're seeing the same error messages in datacatalog and flytepropeller.
f
OK - so in the error you sent above here this “broken pipe” is very suspect. In the doc you mentioned that this comes from dataset:36, is that here ?
About the “broken pipe” which I agree is suspect: I just checked the CloudSQL machine, it is of a rather weak type since so far it didn’t receive much load. Do you think that a slow or failed response could be misinterpreted here? Whether dataset:36 is where you linked I can’t say for sure. The repo doesn’t have to 1.2.1 tag as the image that I’m running.
Give me a second, I’ll run your workflow and observe both the datacatalog as well as the propeller logs.
d
make sure you're running the same version both times. i think fast register till register a new version at each run.
f
First execution: Data catalog: • “[31;1m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:51 [35;1mwrite failed: write tcp 10.52.1.343244 &gt;10.22.0.65432: write: broken pipe”" • “”[0m[33m[1.087ms] [34;1m[rows:0][0m SELECT * FROM “datasets” WHERE “datasets”.“project” = ‘sandbox’ AND “datasets”.“name” = ‘flyte_task-workflow.hello_world’ AND “datasets”.“domain” = ‘development’ AND “datasets”.“version” = ‘1.0-MjvydOS6-zCZxwZgs’ ORDER BY “datasets”.“created_at” LIMIT 1"” • “Unable to get dataset request dataset&lt;project“sandbox” name:“flyte_task-workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” > err: unexpected error type for: write tcp 10.52.1.343244 &gt;10.22.0.65432: write: broken pipe” • Dataset does not exist key: {Project:sandbox Name:flyte_task-workflow.hello_world Domain:development Version:1.0-MjvydOS6-zCZxwZgs UUID:}, err missing entity of type Dataset with identifier project:“sandbox” name:“flyte_task-torch_experiments.workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” ”
I’ll click on relaunch and provide a different value to the workflow
Propeller first execution: • “”Catalog Failure: memoization check failed. err: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 &gt;10.22.0.65432: write: broken pipe” • “failed to check catalog cache with error” • “handling parent node failed with error: Failed to check Catalog for previous results: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 &gt;10.22.0.65432: write: broken pipe” • “Error in handling running workflow [Failed to check Catalog for previous results: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 &gt;10.22.0.65432: write: broken pipe]”
2nd execution (same version) different input passed:
Data catalog 2nd execution: • “Dataset already exists key: id&lt;project“sandbox” name:“flyte_task-workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” > metadata&lt;key map<key:“task-version” value:“hwuiQzzwpRF386ypXM2SYQ==” > > , err value with matching already exists (duplicate key value violates unique constraint “datasets_pkey”)”
• “2023/01/17 150036 [31;1m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:36 [35;1mERROR: duplicate key value violates unique constraint “datasets_pkey” (SQLSTATE 23505)”
d
So they both successfully wrote to cache then? ... interesting
f
In the UI it seems so
Do you see the same results in your datacatalog and flytepropeller?
d
yeah, this is working as expected. i'm not sure how we can repro the GORM write failure - ie. "Failed to create dataset model ... write tcp ...", because this is the 100% the issue.
it seems pretty sparse, maybe it's related to resources as you mentioned ... i'm really not sure. i'm going to do some searches on this and see what i can dig up.
So to quickly summarize this, what we're seeing is intermittent issues with the SQL connection. On certain writes we get an error message from here:
Copy code
{"json":{…}, "level":"error", "msg":"Failed to create dataset model: &{BaseModel:{CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:<nil>} DatasetKey:{Project:object_detection Name:flyte_task-<package name>.applications.<application_name>.<dataset_name>.train_workflow.train_stage Domain:development Version:0.1-NGVJxIhX-egfCQQnT UUID:} SerializedMetadata:[10 67 10 12 116 97 115 107 45 118 101 114 115 105 111 110 18 51 108 117 107 97 115 45 102 101 97 116 45 107 105 116 116 105 95 99 98 100 49 98 102 56 48 95 50 48 50 51 45 48 49 45 49 54 95 48 54 45 53 52 45 50 51 95 100 105 114 116 121] PartitionKeys:[]} err: unexpected error type for: write tcp 10.52.1.3:58112->10.22.0.6:5432: write: broken pipe", "ts":"2023-01-17T01:28:18Z"}
other times we are receiving the error from here:
Copy code
Dataset already exists key: id:<project:"sandbox" name:"flyte_task-workflow.hello_world" domain:"development" version:"1.0-MjvydOS6-zCZxwZgs" > metadata:<key_map:<key:"task-version" value:"hwuiQzzwpRF386ypXM2SYQ==" > > , err value with matching already exists (duplicate key value violates unique constraint "datasets_pkey")
These occur when propeller is initially trying to create the dataset to mark it as cached. Rather than looking up the dataset to see if it exists it attempts to create a new one and detects the "AlreadyExists" error. In the former case, the error does not show that the dataset already exists and propeller fails the cache put.
This is very complicated to debug. I can't image that Google is tearing down the DB and monitoring connections to hot-start it when there are requests. So perhaps it is a GORM connection lifetime issue. Thoughts?
f
Thanks for the summary @Dan Rammer (hamersaw)! I searched the datacatalog logs for all occurences of
Failed to create dataset
. In the 9 days the current pod has been up, there have been 12 occurrences. All 12 occurrences happened in the “actual” workflow built and executed by the ML engineer (who ran this less than a handful of times) where the
train_stage
and
param_search
tasks that show this behaviour take at least 1h but depending on the config up to 3-5h. I copied the engineers workflow to try to create a minimal working example. The signature of the tasks and the structure of the workflow remained the same. But the tasks took only a few seconds to complete. This workflow I ran probably dozens of times. Since the error only happened in the original long running workflow which was executed far less often, I wonder whether there might be a connection somewhere that is kept open for a long time which might then fail due to gcp network errors 🤔 Basically: I wasn’t able to reproduce the result because my “minimal working example” didn’t take long enough.
d
I wasn’t able to reproduce the result because my “minimal working example” didn’t take long enough.
exactly what I was thinking
f
Do you know whether there is a db connection in datacatalog that is kept open for a longer time when the task takes longer?
To test this I started an execution of my “minimal not yet working example” where I inserted
time.sleep(1.5h)
instead of doing actual training to not waste GPU time.
Let’s see.
d
So in datacatalog we rely on GORM to handle the DB connection pool. There does seem to be a few configuration parameters we could play around with the DB connection. ex.
SetConnMaxLifetime
,
SetConnMaxIdleTime
but I'm not sure these will help. Lets see what your test comes back with - if we can reproduce it we can add configuration for the aforementioned options and run some more tests. Does that sound reasonable?
e
Also, just to confirm, datacalog and postgres pods didn't go away / were restarted while this was happening, right?
f
Nope, datacatalog pod has been alive for the past 9 days (during which all of this happened) and also the postgres database, which is not a pod but a managed one by google, doesn’t show any sign of having been restarted.
Lets see what your test comes back with - if we can reproduce it we can add configuration for the aforementioned options and run some more tests. Does that sound reasonable?
Would propose exactly the same. If it doesn’t reproduce the error, I’ll give the postgres instance more resources too see if this resolves it 🤷
Thanks for looking into this today with me @Dan Rammer (hamersaw)!
d
No problem! Lets hope one of these fixes it - I think we're getting close 🙏 😅
f
Hey Dan, quick update: so simply by adding
time.sleep(7200)
I was able to make one of the tasks fail the cache put. Then I re-ran the workflow, this task ran again, and again failed the cache put. I will now use a better machine for the cloud sql instance and retry if this goes away.
Do you have an idea why a longer running task can trigger this? I would assume the cache put operation itself is independent of how long the task runs 🤔
k
Cache put operation is independent. That does not make sense
f
Maybe it is just random then 🤷 I’m as of now provisioning a stronger database machine. I will report whether this makes it go away.
k
Hmmm, this is very critical. Did you see error in logs
Dan already looked through those logs and explained that the relevant logs come from here: https://github.com/flyteorg/datacatalog/blob/faa86dbf56cce108f2c0b91f8fa2a99f67c1586f/pkg/manager/impl/dataset_manager.go#L86
So a connection problem to the database might be the cause.
And so far I’ve been using a cheap db instance.
d
Ok, so at least we can reproduce it. So we use GORM to handle the DB connection pool, my guess is that there is an issue where long-idle connections are terminated. I think we have a few possible solutions here: 1. use a larger DB instance - maybe google is shutting down periodically? 2. use the
SetConnMaxTimeout
,
SetConnMaxIdle
, etc parameters on the DB connection to allow long-idle connections from the client-side 3. hack some kind of periodic "ping" service in datacatalog to ensure connections are not idle for long periods Do you have any other ideas?
f
No, this sounds good to me. I’m re-running a
time.sleep(2h)
workflow with the better db instance now, will report back.
Hey @Dan Rammer (hamersaw) and @Eduardo Apolinario (eapolinario), Update: After upgrading the GCP CloudSQL database to a “not-super-cheap” machine, one that is covered by their SLAs, I unfortunately still see the cache errors (see screenshots).
Next, I took down the Flyte helm release and re-installed the 1.3.0 release. I re-ran this workflow and still see the cache issues.
The workflow contains a double loop in the workflow to construct the DAG. When the ML engineer showed this to me I was actually surprised this works without a map task.
Screenshot 2023-01-20 at 15.42.19.png
Since so far I only saw the cache issues in this kind of workflow, I decided to run a workflow with a trivial structure to see if I can reproduce the errors.
Screenshot 2023-01-20 at 15.44.44.png
Screenshot 2023-01-20 at 15.45.06.png
k
Why not run in sandbox?
f
You mean instead of replacing the cloudsql database with one running in the cluster? Good point, this is quicker.
k
i mean to test it locally and see if it happens again
f
Will do and report back 👍 Have a nice weekend 🙂
d
@Fabio Grätz I can kick off a few tests here on our side - will plan to discuss on Monday.
@Fabio Grätz any news on this? Unfortunately, (or fortunately 😅) I ran the workflow and did not have any issues.
f
Yes, let’s definitely call it fortunately 😅 I replaced the managed database with one running in a stateful set. (This cost me ~5 min since I just copied some existing manifests for this, this is why i didn’t go for local sandbox). Currently it is running without any failures but it also only completed the “first stage” of the param grid searches in the workflow. Will let you know once it finished!
Thanks for running the workflow as well 🙏
d
Perfect! let's hope this was the issue.
f
Replacing the managed GCP CloudSQL database with one running in the cluster did not solve the problem, I was still seeing timeouts. However, I am pretty sure I finally figured out what the problem is. The 40 task 20h workflow I linked above, which consistently had 2-3 cache errors, has almost completed without any problems (final task still running). Now that I know I’m a bit embarrassed that I didn’t realize and test this right away 🙈: Our cluster uses istio as a service mesh. This means that each of the pods in the flyte namespace had an envoy proxy side car that redirects all of its inbound and outbound traffic. Turns out that the envoy proxies have a default timeout for idle connections of 1h. I assume that datacatalog has long-living connections managed by gorm which where then killed by envoy when idling during long-running executions. This would be a logical explanations why I was never able to reproduce this unless the tasks take a long time. I will run more workflow executions over the next days just to be sure that this successful execution wasn’t just random, however, this explanation sounds logical to me and I find it is a more satisfying conclusion than “we couldn’t use a gcp managed database since somehow the connection was spotty”. I’m sorry if I caused worries here. Thanks a ton for trying to figure this out with me.
k
but this should be handled by retries in propeller and connection pooling in gorm?
d
propeller side we gracefully fail (ie. "PUT_FAILURE") if there is a failure during a cache write, so there is no retry.
f
If the cache errors are really solved by removing the envoy proxies which likely kill the connections, I don’t think the retry mechanism works.
k
why not, otherwise all caching will stop working?
@Dan Rammer (hamersaw) we definitely should have grpc retries?
d
the grpc connection from flytepropeller <-> datacatalog isn't the one being dropped, it's the database connection from datacatalog. So propller attempts to write cache, datacatalog returns a failure that says "db connection dropped" (or whatever it was in the logs), and propeller marks the cache status as
PUT_FAILURE
. Would have to look through the code to understand where / if a retry makes sense here.
@Fabio Grätz so the saga continues 😆. I think that we continually get closer though, this sounds very promising. Let us know what we find, maybe we should add some retry in propeller ... that would certainly fix this.
f
If this turns out to be correct, then I shouldn’t see any cache errors any more since I turned off istio for the flyte namespace. Could certainly turn it back on in a sandbox to reproduce this. Maybe there even is a way to set the timeout to a shorter time to allow reproducing this quicker 🤔 Since this all hints at the connection from datacatalog to db being dropped, shouldn’t the retry mechanism be implemented in datacatalog in case a
db connection dropped
is encountered?
d
I would expect GORM and the connection pool would handle dropped connections
f
But it does look like this is not the case, right?
k
maybe you do not have connection pooling enabled?
f
You mean in the datacatalog code or is this user facing?
k
config - lets check - cc @katrina / @Prafulla Mahindrakar do you know this top of mind?
f
Maybe one can use an istio destination rule to artificially configure a very short tcp connection timeout. I can try this
d
re ^^: it's probably worth it to look through admin code, i think there are a few updates to DB interaction that never made it to datacatalog.
k
looks like we have it in admin: https://github.com/flyteorg/flyteadmin/pull/358/files and the new db config in stdlib exposes it as a config option too, but i don't see the code in datacatalog reading it
d
i'll file an issue to add support for that configuration in datacatalog (@Fabio Grätz just looks like the
SetConnMaxLifetime
, etc configuration options we had proposed adding earlier). Regardless of whether we need it for this solution or not, we should have it enabled in datacatalog.
158 Views