Joe Kelly
11/08/2023, 4:41 PMdatacatalog
that is preventing my task caching from working properly. I have a task that up until yesterday was working properly in terms of caching behavior, but is now processing all inputs every time it runs (rather than respecting previously cached results).
I looked into some of our logs and I'm seeing this sort of error repeatedly, which seems likely to be related:
2023/11/08 16:37:57 /go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:36 ERROR: duplicate key value violates unique constraint "datasets_pkey" (SQLSTATE 23505)
66
[2.170ms] [rows:0] INSERT INTO "datasets" ("created_at","updated_at","deleted_at","project","name","domain","version","uuid","serialized_metadata") VALUES ('2023-11-08 16:37:57.613','2023-11-08 16:37:57.612',NULL,'flyte-data','flyte_task-common.task.map_download_file_d1eae65e5508fa39aeffb848a5de666e','production','0.1-Y6BHh9Nf-bd4EAIwD','0a35603a-87be-4009-8967-b2d94bc911b7','<binary>')
65
{"json":{},"level":"warning","msg":"Dataset already exists key: id:\u003cproject:\"flyte-data\" name:\"flyte_task-common.task.map_download_file_d1eae65e5508fa39aeffb848a5de666e\" domain:\"production\" version:\"0.1-Y6BHh9Nf-bd4EAIwD\" \u003e metadata:\u003ckey_map:\u003ckey:\"task-version\" value:\"a060652d7126ee328e25444c3afb5a09e87b0540\" \u003e \u003e , err value with matching already exists (duplicate key value violates unique constraint \"datasets_pkey\")","ts":"2023-11-08T16:37:57Z"}
64
Has anyone seen this error before / know how to resolve it?Joe Kelly
11/09/2023, 1:02 AMdatacatalog
error is related to the weird caching behavior we're observing, as I'm seeing this error pop up for two different task types and I'm only observing the lack of caching on one of those two.
To describe the caching issue we're having in more detail:
• We have a workflow that calls download_file(DownloadableDocument)
through a map_task
, and downstream of that calls parse_file(ParseableDocument)
through a map_task
.
• Prior to this issue, both classes were set up as a dataclass
and the caching for both seemed to work as expected
• We changed the class definition of ParseableDocument
to define a TypeTransformer
so that we could set a specific hash; the caching on ParseableDocument
is still working properly, but now we are never seeing the DownloadableDocument
task (download_file
) have a cache hit (this task outputs the class that we modified, but that seems like it should be irrelevant when deciding whether or not to run the given task)
• I even tried setting a TypeTransformer
for DownloadableDocument
as well to explicitly set its hash, but even with that I am seeing a cache miss every time.
Any help would be appreciated, thanks!Samhita Alla
DownloadableDocument
? Can you just set it to a dataclass and check if the error disappears?Samhita Alla
Joe Kelly
11/09/2023, 4:39 PMTypeTransformer[DownloadableDocument]
or if I just set it as a dataclass.
I've confirmed that in the case where I do have a TypeTransformer[DownloadableDocument]
, even setting my custom hash for DownloadableDocument
to a constant (so everything beyond the first run should be a cache hit), I only get cache misses.Samhita Alla
Slackbot
11/10/2023, 6:19 AMDan Rammer (hamersaw)
11/10/2023, 1:04 PMJoe Kelly
11/10/2023, 4:54 PMJoe Kelly
11/10/2023, 7:27 PMYee
Yee
Yee
Yee