I have a long running workflow in which some tasks that shou Flyte #flyte-support

I have a long running workflow in which some tasks...

cool-lifeguard-49380

01/12/2023, 2:26 PM

I have a long running workflow in which some tasks that should definitely be cached (I checked the tarball containing the code) are not cached 🤯 Does anyone know where I could find any logs that might shed some light on this? Has anyone seen this before?

agreeable-kitchen-44189

01/12/2023, 2:28 PM

We’ve seen something similar lately (only happening in a

map_task

though) - so following along here for any debugging tips

cool-lifeguard-49380

01/12/2023, 2:28 PM

I don’t have a map task but I create tasks in a for loop in the workflow.

broad-monitor-993

01/12/2023, 2:47 PM

Hi Fabio, are those tasks using files or tabular data as input? If not what’s the type signature?

cool-lifeguard-49380

01/12/2023, 2:50 PM

No, not using any tabular data or files that would be expected to brick the cache:

Copy code

def train(
    task_name: str,
    param: List[int],
    stage: StageConfig,
    base_path: str = BASE_PATH,
    warm_start_path: Optional[str] = None,
) -> str:

StageConfig

is a

dataclass_json

that also has other `dataclass_json`s nested under it.

broad-monitor-993

01/12/2023, 2:56 PM

Perhaps a related tangent, but are you specifying defaults in the task signature? I wasn’t aware that defaults were supported in tasks

cool-lifeguard-49380

01/12/2023, 2:57 PM

Yes, I am 🤔 But I have been doing this before as well.

cool-lifeguard-49380

01/12/2023, 2:57 PM

(Didn’t know I wasn’t supposed to ahah)

broad-monitor-993

01/12/2023, 3:36 PM

@high-accountant-32689 @thankful-minister-83577 @hallowed-mouse-14616 I think one of the Q1 roadmap items is related to this ^^

freezing-airport-6809

01/12/2023, 3:37 PM

Yes defaults are not supported - hmm this should have failed compile time. Only none default is allowed - I.e tasks can have optional inputs now - so that you can rev versions

freezing-airport-6809

01/12/2023, 3:38 PM

Base_path is weird. Warm_start_path is ok

broad-monitor-993

01/12/2023, 3:41 PM

would that throw-off the cache though?

freezing-airport-6809

01/12/2023, 3:42 PM

Hmm I am not sure how it will be handled, I think it will be python side only - so it might just work fine

freezing-airport-6809

01/12/2023, 3:42 PM

I don’t think it should impact cache

freezing-airport-6809

01/12/2023, 3:42 PM

Cache seems to be something else

high-accountant-32689

01/12/2023, 8:56 PM

@cool-lifeguard-49380, this is in a dynamic task, right?

hallowed-mouse-14616

01/13/2023, 1:40 PM

@cool-lifeguard-49380, I can dive through the backend code later today and make sure the optional inputs aren't an issue. Then I will circle back with you for any additional debugging info. This is a correctness issue, and therefore the highest priority.

cool-lifeguard-49380

01/13/2023, 1:41 PM

Hey 🙂 I’m currently trying to create a minimal working example from the full 40h 40 task workflow from the ML engineer in our team that reported this.

🙏 1

cool-lifeguard-49380

01/13/2023, 1:42 PM

I already see some stuff in the workflow that could be problematic, e.g.

def train_stage(…) -> (str, str):

instead of

Tuple[str, str]

cool-lifeguard-49380

01/13/2023, 1:42 PM

I think until I have a minimal working example that reproduces this, I’m not sure what you could do honestly since I also have never seen this issue before and I did use task defaults 🤔

👍 1

hallowed-mouse-14616

01/13/2023, 1:44 PM

@agreeable-kitchen-44189 there are some known issues with cache reporting for map task subtasks. I have a PR out into flyteplugins and there is another issue in flyteconsole to fix this. So the UI is not correctly displaying the cache status - however, in the backend everything should be cached correctly. I know this makes this difficult to understand - but it should not affect correctness.

agreeable-kitchen-44189

01/13/2023, 2:06 PM

@hallowed-mouse-14616 Nice, thank you!

hallowed-mouse-14616

01/13/2023, 2:17 PM

no problem! if you think there are larger issues than the reporting. like if, for example, the task durations seem large for something that should have been a cache hit, please let us know. i would be happy to dive into this. it would probably be a bit easier to debug once we get the aforementioned subtask cache status reporting things merged.

agreeable-kitchen-44189

01/13/2023, 2:18 PM

Perfect will do! This was also reported by one of our science teams, will monitor now 👍

👍 1

cool-lifeguard-49380

01/13/2023, 4:45 PM

@hallowed-mouse-14616 @high-accountant-32689 I have a minimal working example that for me consistently fails to cache the same set of the successful tasks. It appears to not be a UI issue only as, when relaunching the workflow, the respective pods do run again. For me, the task

generate_report

fails to cache all 4 times. The task

full_evaluation

in the end fails as well. In this example, all the tasks that fail, thus, are tasks that don’t have a node after them. In the full >50h, ~40 tasks training, some of the

train

tasks (which do have a task after them in the graph) failed to cache as well. This I cannot reproduce. However, the ML engineer who wrote the workflow didn’t run the workflow in one single execution end to end due to intermittent failures. The restarts might, thus, have had some influence as well 🤔 Since the the full execution takes so long, we haven’t run it again. So far, I have only focused on removing logic while making sure that the cache failure persists. I haven’t yet tried to remove defaults in the tasks etc.

hallowed-mouse-14616

01/13/2023, 4:54 PM

@cool-lifeguard-49380 thanks so much for this! I will block some time this afternoon for a deep-dive into this, because I suspect it's going to need one 😄

hallowed-mouse-14616

01/13/2023, 5:00 PM

so this is itching at me, not sure I can wait until this afternoon. The

generate_report

is the function we want to cache right? I am pretty sure we check if there are task outputs, if there are none then there is nothing to cache and we skip it.

cool-lifeguard-49380

01/13/2023, 5:02 PM

😿 I think I just wasted 2h creating a minimal working example that doesn’t reproduce the issue

cool-lifeguard-49380

01/13/2023, 5:05 PM

I’m adding return values to

generate_report

and

full_evaluation

👍 1

hallowed-mouse-14616

01/13/2023, 5:06 PM

... yeah, can confirm that if there are no outputs we disable caching automatically. sorry! please let us know if there's anything we can help with reproducing this. it's very troublesome to hear.

cool-lifeguard-49380

01/13/2023, 5:11 PM

No worries! Unfortunately that means that I cannot reproduce the errors in the full 50h training.

cool-lifeguard-49380

01/13/2023, 5:34 PM

In the original workflow, the

generate_report

doesn’t have a return value so that one is ✅ .

full_evaluation

also has no return value -> ✅ 🤦 silly me … There are two other tasks that do have return values that also showed this behaviour. When trying to construct a minimal working example, I didn’t see this at all today though. The screenshot with the task

search

shows what I called

try_params

in the “minimal working example” (renamed it to avoid internal lingo):

Copy code

@task(
    cache=True,
    cache_version="0.1",
    requests=Resources(gpu="1", cpu="11", mem="44G"),
    disable_deck=False,
)
def try_params(
    ...
) -> str:

    if _is_distributed_worker():
        raise flytekit.core.base_task.IgnoreOutputs()
    else:
        return wandb.run.path

The if statement makes no sense since this is not a pytorch task but should never be True since it checks if

RANK

is set and != 0 which is never the case for Python tasks. For this task, the screenshot shows that two instances running in parallel one time worked, one time didn’t. Then there is the

train

task:

Copy code

@task(
    cache=True,
    cache_version="0.1",
    task_config=PyTorch(
        num_workers=3,
    ),
    requests=Resources(gpu="1", cpu="11", mem="44G"),
    disable_deck=False,
)
def train(
    ....
) -> (str, str):

    if _is_distributed_worker():
        raise flytekit.core.base_task.IgnoreOutputs()
    else:
        return wandb.run.path, out_dir

cool-lifeguard-49380

01/13/2023, 5:34 PM

message has been deleted

cool-lifeguard-49380

01/13/2023, 5:41 PM

@hallowed-mouse-14616 do you know what happens in a distributed pytorch task if RANK 0 returns the true value but the other workers raise

flytekit.core.base_task.IgnoreOutputs()

? If one worker with RANK!=0 finishes first and returns nothing, could this lead to the cache being deactivated?

hallowed-mouse-14616

01/13/2023, 5:50 PM

yeah, i'm taking a look here and not seeing anything. i don't suspect this has anything to do with pytorch vs python task because all of the caching happens in propeller. of course, i could be wrong. i'll have to take a deeper look - it might be back to this afternoon with all the moving parts i'm trying to juggle right now.

cool-lifeguard-49380

01/17/2023, 1:34 PM

Hey @hallowed-mouse-14616, @high-accountant-32689, after I sadly wasn’t able to create a minimal working example that reproduced the cache issues, I reran the full training (albeit with 20 instead of 40 tasks which took 10h). The same cache issues happened again unfortunately. I went through the logs of flytepropeller, datacatalog, and the postgres database and found that for all 3 tasks with cache issues, the logs contained errors like the following almost exactly at the termination time of the tasks: • “Unable to create dataset” • “Failed to write result to catalog for task” • “duplicate key value violates unique constraint ‘dataset_pkey’” • “Dataset already exists” This document contains all relevant logs I could find. Could you please take a look at them? If there is anything else I can try or search for in the logs, please let me know!

hallowed-mouse-14616

01/17/2023, 1:39 PM

Fabio, thanks so much for this. I was going to circle back with you today (we were off yesterday for US holiday). Ran some tests with the

IgnoreOutputs

stuff - propeller should fail the task if it tries to cache it and the outputs were ignored, so that can not be it. The next things was that propeller transparently fails caching. So exactly what you are seeing, the caching mechanism runs into a failure and propeller will still mark the task as succeeded but just not cached. We should really make this more observable. All of the "Failed to write result to catalog for task" indicates this is what is happening. I will dive into this.

cool-lifeguard-49380

01/17/2023, 1:42 PM

Yes, I also tested whether it could be

IgnoreOutputs

by intentionally delaying RANK 0 (or every worker but RANK 0) to intentionally trigger a potential race condition. But the correct return value from RANK 0 was always retrieved…

👍 1

cool-lifeguard-49380

01/17/2023, 1:43 PM

Please let me know if there is anything else I can search for in the logs

hallowed-mouse-14616

01/17/2023, 1:49 PM

Yeah, so this is an error in the datacatalog. I know exactly where it happens - just need to figure out how to repro.

🙏 1

cool-lifeguard-49380

01/17/2023, 1:50 PM

One thing I cannot judge whether it is important: We register workflows and start executions with flyte remote (because we automatically retrieve the version information from git). We make use of passing custom execution names to

remote.execute

. The execution id of the run was fc3e15e42c6ec4043b46-sbs-9
. Our execution names are generated similarly to what flytekit does, only that we allow a user defined prefix + uid:

Copy code

uuid_len = 20
    value = value + "-" + uuid.uuid4().hex[:uuid_len]

cool-lifeguard-49380

01/17/2023, 1:52 PM

But I started more than 100 executions with this only on last Friday when trying to create a minimal working example, didn’t observe any issues caused by that.

cool-lifeguard-49380

01/17/2023, 1:53 PM

Yeah, so this is an error in the datacatalog. I know exactly where it happens - just need to figure out how to repro.

If there is anything I can help with, I’m happy to search for more stuff in the logs or try to execute stuff. Just ping

hallowed-mouse-14616

01/17/2023, 1:57 PM

On quick question, in the datacatalog logs there are a few

DatasetKey

instances printed off that you have blurred out (reasonably so), I don't necessarily care about the values but are the fields

Project

Domain

Name

, and

Version

? There is on

UUID

correct?

cool-lifeguard-49380

01/17/2023, 1:57 PM

Give me one sec, I’ll check again

hallowed-mouse-14616

01/17/2023, 1:58 PM

Also, do you know which version of flytepropeller you are running?

cool-lifeguard-49380

01/17/2023, 2:02 PM

Copy code

{"json":{…}, "level":"error", "msg":"Failed to create dataset model: &{BaseModel:{CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:<nil>} DatasetKey:{Project:object_detection Name:flyte_task-<package name>.applications.<application_name>.<dataset_name>.train_workflow.train_stage Domain:development Version:0.1-NGVJxIhX-egfCQQnT UUID:} SerializedMetadata:[10 67 10 12 116 97 115 107 45 118 101 114 115 105 111 110 18 51 108 117 107 97 115 45 102 101 97 116 45 107 105 116 116 105 95 99 98 100 49 98 102 56 48 95 50 48 50 51 45 48 49 45 49 54 95 48 54 45 53 52 45 50 51 95 100 105 114 116 121] PartitionKeys:[]} err: unexpected error type for: write tcp 10.52.1.3:58112->10.22.0.6:5432: write: broken pipe", "ts":"2023-01-17T01:28:18Z"}

🙌 1

cool-lifeguard-49380

01/17/2023, 2:03 PM

Where I redacted internal stuff I was asked to remove, I inserted placeholders like

<dataset name>

. I explicitly did not change the part where it says

UUID:}

hallowed-mouse-14616

01/17/2023, 2:04 PM

nvm about the propeller version. this is a datacatalog thing.

cool-lifeguard-49380

01/17/2023, 2:05 PM

image: <http://cr.flyte.org/flyteorg/datacatalog-release:v1.2.1|cr.flyte.org/flyteorg/datacatalog-release:v1.2.1>

cool-lifeguard-49380

01/17/2023, 2:05 PM

Flytepropeller is actually one I built myself I just realized - final state of the PR in which I added task templates to pytorch jobs!

hallowed-mouse-14616

01/17/2023, 2:35 PM

OK - so in the error you sent above here this "broken pipe" is very suspect. In the doc you mentioned that this comes from dataset:36, is that here ?

hallowed-mouse-14616

01/17/2023, 2:37 PM

I'm wondering if there is an issue with the google cloud SQL server and GORM. We use the return code of the

h.db.Create

call to check if the item already exists or not. It seems the error that is returned from that call is not an "AlreadyExists" error so we identify it as something more serious.

hallowed-mouse-14616

01/17/2023, 2:41 PM

Can we try something very simple - can we cache two different items for the same task version? Something like:

Copy code

@task(cache=True, cache_version="1.0")
def hello_world(name: str) -> str:
    return f"hello {name}"

@workflow
def hello_world_wf(name: str) -> str:
    return hello_world(name=name)

and try calling it with different values, ie:

foo

bar

and see what the behavior is. I fear that the call to

bar

will have the same result as here; namely, GORM isn't detecting the dataset "AlreadyExists" and the cache put fails and

bar

is not cached.

hallowed-mouse-14616

01/17/2023, 2:43 PM

Maybe keep an eye on the logs if it is possible just to validate we're seeing the same error messages in datacatalog and flytepropeller.

cool-lifeguard-49380

01/17/2023, 2:48 PM

OK - so in the error you sent above here this “broken pipe” is very suspect. In the doc you mentioned that this comes from dataset:36, is that here ?

About the “broken pipe” which I agree is suspect: I just checked the CloudSQL machine, it is of a rather weak type since so far it didn’t receive much load. Do you think that a slow or failed response could be misinterpreted here? Whether dataset:36 is where you linked I can’t say for sure. The repo doesn’t have to 1.2.1 tag as the image that I’m running.

cool-lifeguard-49380

01/17/2023, 2:48 PM

Give me a second, I’ll run your workflow and observe both the datacatalog as well as the propeller logs.

hallowed-mouse-14616

01/17/2023, 2:59 PM

make sure you're running the same version both times. i think fast register till register a new version at each run.

cool-lifeguard-49380

01/17/2023, 3:00 PM

First execution: Data catalog: • “[31;1m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:51 [35;1mwrite failed: write tcp 10.52.1.343244 >10.22.0.65432: write: broken pipe”" • “”[0m[33m[1.087ms] [34;1m[rows:0][0m SELECT * FROM “datasets” WHERE “datasets”.“project” = ‘sandbox’ AND “datasets”.“name” = ‘flyte_task-workflow.hello_world’ AND “datasets”.“domain” = ‘development’ AND “datasets”.“version” = ‘1.0-MjvydOS6-zCZxwZgs’ ORDER BY “datasets”.“created_at” LIMIT 1"” • “Unable to get dataset request dataset<project“sandbox” name:“flyte_task-workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” > err: unexpected error type for: write tcp 10.52.1.343244 >10.22.0.65432: write: broken pipe” • Dataset does not exist key: {Project:sandbox Name:flyte_task-workflow.hello_world Domain:development Version:1.0-MjvydOS6-zCZxwZgs UUID:}, err missing entity of type Dataset with identifier project:“sandbox” name:“flyte_task-torch_experiments.workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” ”

cool-lifeguard-49380

01/17/2023, 3:00 PM

I’ll click on relaunch and provide a different value to the workflow

cool-lifeguard-49380

01/17/2023, 3:03 PM

Propeller first execution: • “”Catalog Failure: memoization check failed. err: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 >10.22.0.65432: write: broken pipe” • “failed to check catalog cache with error” • “handling parent node failed with error: Failed to check Catalog for previous results: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 >10.22.0.65432: write: broken pipe” • “Error in handling running workflow [Failed to check Catalog for previous results: DataCatalog failed to get dataset for ID resource_type:TASK project:“sandbox” domain:“development” name:“workflow.hello_world” version:“hwuiQzzwpRF386ypXM2SYQ==” : rpc error: code = Internal desc = unexpected error type for: write tcp 10.52.1.343244 >10.22.0.65432: write: broken pipe]”

cool-lifeguard-49380

01/17/2023, 3:03 PM

2nd execution (same version) different input passed:

cool-lifeguard-49380

01/17/2023, 3:05 PM

Data catalog 2nd execution: • “Dataset already exists key: id<project“sandbox” name:“flyte_task-workflow.hello_world” domain:“development” version:“1.0-MjvydOS6-zCZxwZgs” > metadata<key map<key:“task-version” value:“hwuiQzzwpRF386ypXM2SYQ==” > > , err value with matching already exists (duplicate key value violates unique constraint “datasets_pkey”)”

cool-lifeguard-49380

01/17/2023, 3:06 PM

• “2023/01/17 150036 [31;1m/go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:36 [35;1mERROR: duplicate key value violates unique constraint “datasets_pkey” (SQLSTATE 23505)”

hallowed-mouse-14616

01/17/2023, 3:06 PM

So they both successfully wrote to cache then? ... interesting

cool-lifeguard-49380

01/17/2023, 3:07 PM

In the UI it seems so

cool-lifeguard-49380

01/17/2023, 3:16 PM

Do you see the same results in your datacatalog and flytepropeller?

hallowed-mouse-14616

01/17/2023, 3:18 PM

yeah, this is working as expected. i'm not sure how we can repro the GORM write failure - ie. "Failed to create dataset model ... write tcp ...", because this is the 100% the issue.

hallowed-mouse-14616

01/17/2023, 3:20 PM

it seems pretty sparse, maybe it's related to resources as you mentioned ... i'm really not sure. i'm going to do some searches on this and see what i can dig up.

hallowed-mouse-14616

01/17/2023, 4:18 PM

So to quickly summarize this, what we're seeing is intermittent issues with the SQL connection. On certain writes we get an error message from here:

Copy code

{"json":{…}, "level":"error", "msg":"Failed to create dataset model: &{BaseModel:{CreatedAt:0001-01-01 00:00:00 +0000 UTC UpdatedAt:0001-01-01 00:00:00 +0000 UTC DeletedAt:<nil>} DatasetKey:{Project:object_detection Name:flyte_task-<package name>.applications.<application_name>.<dataset_name>.train_workflow.train_stage Domain:development Version:0.1-NGVJxIhX-egfCQQnT UUID:} SerializedMetadata:[10 67 10 12 116 97 115 107 45 118 101 114 115 105 111 110 18 51 108 117 107 97 115 45 102 101 97 116 45 107 105 116 116 105 95 99 98 100 49 98 102 56 48 95 50 48 50 51 45 48 49 45 49 54 95 48 54 45 53 52 45 50 51 95 100 105 114 116 121] PartitionKeys:[]} err: unexpected error type for: write tcp 10.52.1.3:58112->10.22.0.6:5432: write: broken pipe", "ts":"2023-01-17T01:28:18Z"}

other times we are receiving the error from here:

Copy code

Dataset already exists key: id:<project:"sandbox" name:"flyte_task-workflow.hello_world" domain:"development" version:"1.0-MjvydOS6-zCZxwZgs" > metadata:<key_map:<key:"task-version" value:"hwuiQzzwpRF386ypXM2SYQ==" > > , err value with matching already exists (duplicate key value violates unique constraint "datasets_pkey")

These occur when propeller is initially trying to create the dataset to mark it as cached. Rather than looking up the dataset to see if it exists it attempts to create a new one and detects the "AlreadyExists" error. In the former case, the error does not show that the dataset already exists and propeller fails the cache put.

hallowed-mouse-14616

01/17/2023, 4:20 PM

This is very complicated to debug. I can't image that Google is tearing down the DB and monitoring connections to hot-start it when there are requests. So perhaps it is a GORM connection lifetime issue. Thoughts?

cool-lifeguard-49380

01/17/2023, 4:42 PM

Thanks for the summary @hallowed-mouse-14616! I searched the datacatalog logs for all occurences of

Failed to create dataset

. In the 9 days the current pod has been up, there have been 12 occurrences. All 12 occurrences happened in the “actual” workflow built and executed by the ML engineer (who ran this less than a handful of times) where the

train_stage

and

param_search

tasks that show this behaviour take at least 1h but depending on the config up to 3-5h. I copied the engineers workflow to try to create a minimal working example. The signature of the tasks and the structure of the workflow remained the same. But the tasks took only a few seconds to complete. This workflow I ran probably dozens of times. Since the error only happened in the original long running workflow which was executed far less often, I wonder whether there might be a connection somewhere that is kept open for a long time which might then fail due to gcp network errors 🤔 Basically: I wasn’t able to reproduce the result because my “minimal working example” didn’t take long enough.

hallowed-mouse-14616

01/17/2023, 4:43 PM

I wasn’t able to reproduce the result because my “minimal working example” didn’t take long enough.

exactly what I was thinking

cool-lifeguard-49380

01/17/2023, 4:47 PM

Do you know whether there is a db connection in datacatalog that is kept open for a longer time when the task takes longer?

cool-lifeguard-49380

01/17/2023, 6:15 PM

To test this I started an execution of my “minimal not yet working example” where I inserted

time.sleep(1.5h)

instead of doing actual training to not waste GPU time.

cool-lifeguard-49380

01/17/2023, 6:15 PM

Let’s see.

hallowed-mouse-14616

01/17/2023, 6:18 PM

So in datacatalog we rely on GORM to handle the DB connection pool. There does seem to be a few configuration parameters we could play around with the DB connection. ex.

SetConnMaxLifetime

SetConnMaxIdleTime

but I'm not sure these will help. Lets see what your test comes back with - if we can reproduce it we can add configuration for the aforementioned options and run some more tests. Does that sound reasonable?

high-accountant-32689

01/17/2023, 6:52 PM

Also, just to confirm, datacalog and postgres pods didn't go away / were restarted while this was happening, right?

cool-lifeguard-49380

01/17/2023, 7:14 PM

Nope, datacatalog pod has been alive for the past 9 days (during which all of this happened) and also the postgres database, which is not a pod but a managed one by google, doesn’t show any sign of having been restarted.

cool-lifeguard-49380

01/17/2023, 7:15 PM

Lets see what your test comes back with - if we can reproduce it we can add configuration for the aforementioned options and run some more tests. Does that sound reasonable?

Would propose exactly the same. If it doesn’t reproduce the error, I’ll give the postgres instance more resources too see if this resolves it 🤷

cool-lifeguard-49380

01/17/2023, 7:20 PM

Thanks for looking into this today with me @hallowed-mouse-14616!

hallowed-mouse-14616

01/17/2023, 8:07 PM

No problem! Lets hope one of these fixes it - I think we're getting close 🙏 😅

cool-lifeguard-49380

01/18/2023, 2:35 PM

Hey Dan, quick update: so simply by adding

time.sleep(7200)

I was able to make one of the tasks fail the cache put. Then I re-ran the workflow, this task ran again, and again failed the cache put. I will now use a better machine for the cloud sql instance and retry if this goes away.

cool-lifeguard-49380

01/18/2023, 2:35 PM

Do you have an idea why a longer running task can trigger this? I would assume the cache put operation itself is independent of how long the task runs 🤔

freezing-airport-6809

01/18/2023, 3:35 PM

Cache put operation is independent. That does not make sense

cool-lifeguard-49380

01/18/2023, 3:36 PM

Maybe it is just random then 🤷 I’m as of now provisioning a stronger database machine. I will report whether this makes it go away.

freezing-airport-6809

01/18/2023, 3:37 PM

Hmmm, this is very critical. Did you see error in logs

cool-lifeguard-49380

01/18/2023, 3:38 PM

https://docs.google.com/document/d/1GZ4O31W-ACtIiiTlmAFhyoarhQoGQrskT6aJp9B20G8/edit#heading=h.aldbv27s58hz

cool-lifeguard-49380

01/18/2023, 3:39 PM

Dan already looked through those logs and explained that the relevant logs come from here: https://github.com/flyteorg/datacatalog/blob/faa86dbf56cce108f2c0b91f8fa2a99f67c1586f/pkg/manager/impl/dataset_manager.go#L86

cool-lifeguard-49380

01/18/2023, 3:39 PM

So a connection problem to the database might be the cause.

cool-lifeguard-49380

01/18/2023, 3:40 PM

And so far I’ve been using a cheap db instance.

hallowed-mouse-14616

01/18/2023, 4:37 PM

Ok, so at least we can reproduce it. So we use GORM to handle the DB connection pool, my guess is that there is an issue where long-idle connections are terminated. I think we have a few possible solutions here: 1. use a larger DB instance - maybe google is shutting down periodically? 2. use the

SetConnMaxTimeout

SetConnMaxIdle

, etc parameters on the DB connection to allow long-idle connections from the client-side 3. hack some kind of periodic "ping" service in datacatalog to ensure connections are not idle for long periods Do you have any other ideas?

cool-lifeguard-49380

01/18/2023, 4:44 PM

No, this sounds good to me. I’m re-running a

time.sleep(2h)

workflow with the better db instance now, will report back.

🙏 2

cool-lifeguard-49380

01/20/2023, 2:40 PM

Hey @hallowed-mouse-14616 and @high-accountant-32689, Update: After upgrading the GCP CloudSQL database to a “not-super-cheap” machine, one that is covered by their SLAs, I unfortunately still see the cache errors (see screenshots).

cool-lifeguard-49380

01/20/2023, 2:40 PM

Next, I took down the Flyte helm release and re-installed the 1.3.0 release. I re-ran this workflow and still see the cache issues.

cool-lifeguard-49380

01/20/2023, 2:42 PM

The workflow contains a double loop in the workflow to construct the DAG. When the ML engineer showed this to me I was actually surprised this works without a map task.

cool-lifeguard-49380

01/20/2023, 2:42 PM

message has been deleted

cool-lifeguard-49380

01/20/2023, 2:44 PM

Since so far I only saw the cache issues in this kind of workflow, I decided to run a workflow with a trivial structure to see if I can reproduce the errors.

cool-lifeguard-49380

01/20/2023, 2:44 PM

message has been deleted

cool-lifeguard-49380

01/20/2023, 2:45 PM

message has been deleted

cool-lifeguard-49380

01/20/2023, 2:52 PM

As far as I see it, this bespeaks that there is no hidden weird user error somewhere in the double loop workflow and it must be infra issues. @hallowed-mouse-14616 you wrote about potential next steps:

1. use the
SetConnMaxTimeout
,
SetConnMaxIdle
, etc parameters on the DB connection to allow long-idle connections from the client-side

2. hack some kind of periodic “ping” service in datacatalog to ensure connections are not idle for long periods

We had our mlops team weekly today. We decided that next week we will try to replace the google managed cloudsql database with one that is running in a StatefulSet in the cluster itself. My team lead explained that he had some reliability issues with a managed in memory redis store managed by google a few months back and switched to one he is running himself on GCE. I think it’s worth a try before we tackle the points you mentioned. Would you be willing to run e.g. the double loop workflow in your infra @hallowed-mouse-14616? Unfortunately it sleeps forever and wastes resources. But I think it would be valuable to get the feedback that it in fact does run smoothly in other infrastructures.

freezing-airport-6809

01/20/2023, 2:56 PM

Why not run in sandbox?

cool-lifeguard-49380

01/20/2023, 2:57 PM

You mean instead of replacing the cloudsql database with one running in the cluster? Good point, this is quicker.

freezing-airport-6809

01/20/2023, 7:58 PM

i mean to test it locally and see if it happens again

cool-lifeguard-49380

01/21/2023, 10:47 AM

Will do and report back 👍 Have a nice weekend 🙂

hallowed-mouse-14616

01/21/2023, 2:28 PM

@cool-lifeguard-49380 I can kick off a few tests here on our side - will plan to discuss on Monday.

hallowed-mouse-14616

01/23/2023, 5:15 PM

@cool-lifeguard-49380 any news on this? Unfortunately, (or fortunately 😅) I ran the workflow and did not have any issues.

👀 1

🙏 1

cool-lifeguard-49380

01/23/2023, 5:19 PM

Yes, let’s definitely call it fortunately 😅 I replaced the managed database with one running in a stateful set. (This cost me ~5 min since I just copied some existing manifests for this, this is why i didn’t go for local sandbox). Currently it is running without any failures but it also only completed the “first stage” of the param grid searches in the workflow. Will let you know once it finished!

🙏 1

cool-lifeguard-49380

01/23/2023, 5:22 PM

Thanks for running the workflow as well 🙏

hallowed-mouse-14616

01/23/2023, 5:31 PM

Perfect! let's hope this was the issue.

🤞 1

cool-lifeguard-49380

01/24/2023, 5:26 PM

Replacing the managed GCP CloudSQL database with one running in the cluster did not solve the problem, I was still seeing timeouts. However, I am pretty sure I finally figured out what the problem is. The 40 task 20h workflow I linked above, which consistently had 2-3 cache errors, has almost completed without any problems (final task still running). Now that I know I’m a bit embarrassed that I didn’t realize and test this right away 🙈: Our cluster uses istio as a service mesh. This means that each of the pods in the flyte namespace had an envoy proxy side car that redirects all of its inbound and outbound traffic. Turns out that the envoy proxies have a default timeout for idle connections of 1h. I assume that datacatalog has long-living connections managed by gorm which where then killed by envoy when idling during long-running executions. This would be a logical explanations why I was never able to reproduce this unless the tasks take a long time. I will run more workflow executions over the next days just to be sure that this successful execution wasn’t just random, however, this explanation sounds logical to me and I find it is a more satisfying conclusion than “we couldn’t use a gcp managed database since somehow the connection was spotty”. I’m sorry if I caused worries here. Thanks a ton for trying to figure this out with me.

freezing-airport-6809

01/24/2023, 5:28 PM

but this should be handled by retries in propeller and connection pooling in gorm?

hallowed-mouse-14616

01/24/2023, 5:28 PM

propeller side we gracefully fail (ie. "PUT_FAILURE") if there is a failure during a cache write, so there is no retry.

cool-lifeguard-49380

01/24/2023, 5:29 PM

If the cache errors are really solved by removing the envoy proxies which likely kill the connections, I don’t think the retry mechanism works.

freezing-airport-6809

01/24/2023, 5:29 PM

why not, otherwise all caching will stop working?

freezing-airport-6809

01/24/2023, 5:30 PM

@hallowed-mouse-14616 we definitely should have grpc retries?

hallowed-mouse-14616

01/24/2023, 5:32 PM

the grpc connection from flytepropeller <-> datacatalog isn't the one being dropped, it's the database connection from datacatalog. So propller attempts to write cache, datacatalog returns a failure that says "db connection dropped" (or whatever it was in the logs), and propeller marks the cache status as

PUT_FAILURE

. Would have to look through the code to understand where / if a retry makes sense here.

hallowed-mouse-14616

01/24/2023, 5:33 PM

@cool-lifeguard-49380 so the saga continues 😆. I think that we continually get closer though, this sounds very promising. Let us know what we find, maybe we should add some retry in propeller ... that would certainly fix this.

cool-lifeguard-49380

01/24/2023, 5:37 PM

If this turns out to be correct, then I shouldn’t see any cache errors any more since I turned off istio for the flyte namespace. Could certainly turn it back on in a sandbox to reproduce this. Maybe there even is a way to set the timeout to a shorter time to allow reproducing this quicker 🤔 Since this all hints at the connection from datacatalog to db being dropped, shouldn’t the retry mechanism be implemented in datacatalog in case a

db connection dropped

is encountered?

hallowed-mouse-14616

01/24/2023, 5:38 PM

I would expect GORM and the connection pool would handle dropped connections

👀 1

cool-lifeguard-49380

01/24/2023, 5:40 PM

But it does look like this is not the case, right?

➕ 1

freezing-airport-6809

01/24/2023, 5:48 PM

maybe you do not have connection pooling enabled?

cool-lifeguard-49380

01/24/2023, 5:49 PM

You mean in the datacatalog code or is this user facing?

freezing-airport-6809

01/24/2023, 5:50 PM

config - lets check - cc @acceptable-policeman-57188 / @icy-agent-73298 do you know this top of mind?

cool-lifeguard-49380

01/24/2023, 5:50 PM

Maybe one can use an istio destination rule to artificially configure a very short tcp connection timeout. I can try this

hallowed-mouse-14616

01/24/2023, 5:51 PM

re ^^: it's probably worth it to look through admin code, i think there are a few updates to DB interaction that never made it to datacatalog.

acceptable-policeman-57188

01/24/2023, 5:55 PM

looks like we have it in admin: https://github.com/flyteorg/flyteadmin/pull/358/files and the new db config in stdlib exposes it as a config option too, but i don't see the code in datacatalog reading it

👀 1

hallowed-mouse-14616

01/24/2023, 6:01 PM

i'll file an issue to add support for that configuration in datacatalog (@cool-lifeguard-49380 just looks like the

SetConnMaxLifetime

, etc configuration options we had proposed adding earlier). Regardless of whether we need it for this solution or not, we should have it enabled in datacatalog.

👍 1

🙏 1

hallowed-mouse-14616

01/25/2023, 12:20 AM

https://github.com/flyteorg/datacatalog/pull/89

173 Views

Open in Slack

Previous Next