we're getting collisions with the execution IDs of...
# announcements
j
we're getting collisions with the execution IDs of subworkflow launch plans that propeller is kicking off:
Copy code
Setting the MetadataDir for StartNode [<gs://flyte-freenome-computational/metadata/propeller/balrog-production-fldyczra/start-node/data>]
fldyczra
was already used. given that these are much shorter than what can be used, is there an easy way to make these longer to avoid possible collisions?
the consequence of this is that propeller is passing totally wrong values (derived from the older execution) to downstream tasks, which is potentially very dangerous.
tagging @Anna Cunningham that first discovered this
k
@jeev we migrated to 20 character ids recently
j
@Ketan (kumare3): what version of flytepropeller?
where did you get the subworkflow id of 8 chars?
actually nevermind, this is 3 years ago
j
we're running:
Copy code
image: <http://cr.flyte.org/flyteorg/flytepropeller:v1.1.15@sha256:6630864d9adc1e6ecd980376b244f826cbee5962a36fb9b7760a840bad70447b|cr.flyte.org/flyteorg/flytepropeller:v1.1.15@sha256:6630864d9adc1e6ecd980376b244f826cbee5962a36fb9b7760a840bad70447b>
I think this is the problem
cc @Dan Rammer (hamersaw)
i dont know how to make it longer?
j
collisions seem highly unlikely. just bad luck?
k
Ya
With 8 characters there is higher likelihood as compared to 20 chars
j
right
after some major digging, we found the colliding input strings:
fbffih2q-n2-0-n3-0-dn0-0
and
f24nu5ka-n2-0-n3-0-dn4-0
hash to the same value:
fldyczra
fbffih2q-n2-0-n3-0-dn1-0
and
f24nu5ka-n2-0-n3-0-dn5-0
hash to:
fkyq4zuy
so between these 2 executions
fbffih2q
and
f24nu5ka
we had like 7 or so collisions 😞
d
OK, first, thanks for finding this! It can't have been fun to track down. For an immediate fix I think it makes sense to increase the hashing algorithm to 64 bit. Obviously collisions may still occur, so we should probably detect launch plan execution id collisions - this is a bit more involved.
🙏 3
k
This is very involved
1
🙏 2
j
we'll look into other things like object lifecycle policies and memoization ttl to reduce the likelihood of this happening. but ultimately that would really just be skirting around the issue.
but not 100% sure if its ok to lifecycle the propeller metadata
k
It is ok to lifecycle
j
that would probably break getting execution data via remote or the console
no?
k
Just delete the db entry too
j
hmm
we're depending too much on flyte metadata right now, but maybe this is something we should look into
a
any suggestions for how we can work around this issue without making any changes in the DB, or is that the only way? the first time this happened, I just relaunched the workflow after renaming an input file so it wouldn’t try reading from the cache. it happened again with another workflow today, and we’d prefer not to relaunch this workflow, as it’s very large and computationally expensive to rerun
j
i wonder if this says anything about the hash algorithm itself and it’s usefulness here. should we be worried?
it’s especially dangerous because the collisions we have seen are from similar positions in the DAG. as such, it’s possible that there might be a silent failure.
k
Hmm the problem is that the generated strong and the hashing algorithm Fnv are unable to have enough entropy
We should prefix the generated string with the exec id
But then we have to conform
j
@Anna Cunningham would you be open to coordinating a solution with the flyte team on monday? unfortunately i’ll still be OOO.
k
Cc @Haytham Abuelfutuh / @Dan Rammer (hamersaw) this is very disturbing
a
I’m not as familiar with the internals but I can do my best! this is definitely an urgent and serious issue for us
k
Yes I get and we are here to support
🙌 2
h
Apologies for the delay, @Anna Cunningham. Was just made aware of this issue. Is there a github issue for it yet? if not, do you mind filing one? happy to help push a quick fix forward…
a
sure where should I file it? not sure if this should go in
flytekit
repo or elsewhere
h
a
submitted, but please reach out if you need more information!
h
Thank you!
m
Hey guys, just a heads up, my coworker just had the same problem. It was also a complex workflow comprised of many subworkflows.
k
Ohh no
This fix will do it
Sorry about that
m
What's the current workaround for this? Should we delete the project from our database?
a
any updates on this fix? I see the github issue is still open and hasn’t had much activity
k
Cc @Haytham Abuelfutuh
w
Thanks @Ketan (kumare3) and @Haytham Abuelfutuh for your help on this
Hey @Ketan (kumare3) and @Haytham Abuelfutuh do you have an update on this?
d
@will danforth, thanks for your patience on this! We merged support for 64bit fnv in flyteplugins and should be merging support for this change (with backwards compatability) in FlytePropeller shortly (ie. hopefully later today).
k
@Rahul Mehta this should also give you large names
160 Views