Hi all, I'm running into what seems to be an issu...
# ask-the-community
j
Hi all, I'm running into what seems to be an issue with
datacatalog
that is preventing my task caching from working properly. I have a task that up until yesterday was working properly in terms of caching behavior, but is now processing all inputs every time it runs (rather than respecting previously cached results). I looked into some of our logs and I'm seeing this sort of error repeatedly, which seems likely to be related:
Copy code
2023/11/08 16:37:57 /go/src/github.com/flyteorg/datacatalog/pkg/repositories/gormimpl/dataset.go:36 ERROR: duplicate key value violates unique constraint "datasets_pkey" (SQLSTATE 23505)
66
[2.170ms] [rows:0] INSERT INTO "datasets" ("created_at","updated_at","deleted_at","project","name","domain","version","uuid","serialized_metadata") VALUES ('2023-11-08 16:37:57.613','2023-11-08 16:37:57.612',NULL,'flyte-data','flyte_task-common.task.map_download_file_d1eae65e5508fa39aeffb848a5de666e','production','0.1-Y6BHh9Nf-bd4EAIwD','0a35603a-87be-4009-8967-b2d94bc911b7','<binary>')
65
{"json":{},"level":"warning","msg":"Dataset already exists key: id:\u003cproject:\"flyte-data\" name:\"flyte_task-common.task.map_download_file_d1eae65e5508fa39aeffb848a5de666e\" domain:\"production\" version:\"0.1-Y6BHh9Nf-bd4EAIwD\" \u003e metadata:\u003ckey_map:\u003ckey:\"task-version\" value:\"a060652d7126ee328e25444c3afb5a09e87b0540\" \u003e \u003e , err value with matching already exists (duplicate key value violates unique constraint \"datasets_pkey\")","ts":"2023-11-08T16:37:57Z"}
64
Has anyone seen this error before / know how to resolve it?
I'm actually not sure whether this
datacatalog
error is related to the weird caching behavior we're observing, as I'm seeing this error pop up for two different task types and I'm only observing the lack of caching on one of those two. To describe the caching issue we're having in more detail: • We have a workflow that calls
download_file(DownloadableDocument)
through a
map_task
, and downstream of that calls
parse_file(ParseableDocument)
through a
map_task
. • Prior to this issue, both classes were set up as a
dataclass
and the caching for both seemed to work as expected • We changed the class definition of
ParseableDocument
to define a
TypeTransformer
so that we could set a specific hash; the caching on
ParseableDocument
is still working properly, but now we are never seeing the
DownloadableDocument
task (
download_file
) have a cache hit (this task outputs the class that we modified, but that seems like it should be irrelevant when deciding whether or not to run the given task) • I even tried setting a
TypeTransformer
for
DownloadableDocument
as well to explicitly set its hash, but even with that I am seeing a cache miss every time. Any help would be appreciated, thanks!
s
Can you not use type transformer for
DownloadableDocument
? Can you just set it to a dataclass and check if the error disappears?
If that's the case, have you double-checked your custom hashing implementation?
j
The lack of caching occurs whether I have a
TypeTransformer[DownloadableDocument]
or if I just set it as a dataclass. I've confirmed that in the case where I do have a
TypeTransformer[DownloadableDocument]
, even setting my custom hash for
DownloadableDocument
to a constant (so everything beyond the first run should be a cache hit), I only get cache misses.
s
@Joe Kelly can you create an issue? [flyte-bug]
d
@Joe Kelly do you have a minimal reproducible example? It would really help debug this.
j
I will work on creating a small reproducible example so I can make an effective bug report; thanks all
Turns out when creating the reproducible example I actually found the solution here.. I still opened up a bug since I think this silent cache failure is strange behavior for this set of circumstances though: https://github.com/flyteorg/flyte/issues/4403
y
thanks for the ticket.
what do you mean by hash?
are you trying to control the hash for the caching?
if so, why does the HashMethod annotation not work?