https://flyte.org logo
#announcements
Title
# announcements
j

jeev

08/16/2022, 11:41 PM
we're getting collisions with the execution IDs of subworkflow launch plans that propeller is kicking off:
Copy code
Setting the MetadataDir for StartNode [<gs://flyte-freenome-computational/metadata/propeller/balrog-production-fldyczra/start-node/data>]
fldyczra
was already used. given that these are much shorter than what can be used, is there an easy way to make these longer to avoid possible collisions?
the consequence of this is that propeller is passing totally wrong values (derived from the older execution) to downstream tasks, which is potentially very dangerous.
tagging @Anna Cunningham that first discovered this
k

Ketan (kumare3)

08/16/2022, 11:56 PM
@jeev we migrated to 20 character ids recently
j

jeev

08/16/2022, 11:57 PM
@Ketan (kumare3): what version of flytepropeller?
where did you get the subworkflow id of 8 chars?
actually nevermind, this is 3 years ago
j

jeev

08/16/2022, 11:58 PM
we're running:
Copy code
image: <http://cr.flyte.org/flyteorg/flytepropeller:v1.1.15@sha256:6630864d9adc1e6ecd980376b244f826cbee5962a36fb9b7760a840bad70447b|cr.flyte.org/flyteorg/flytepropeller:v1.1.15@sha256:6630864d9adc1e6ecd980376b244f826cbee5962a36fb9b7760a840bad70447b>
I think this is the problem
cc @Dan Rammer (hamersaw)
i dont know how to make it longer?
j

jeev

08/17/2022, 12:06 AM
collisions seem highly unlikely. just bad luck?
k

Ketan (kumare3)

08/17/2022, 1:43 AM
Ya
With 8 characters there is higher likelihood as compared to 20 chars
j

jeev

08/17/2022, 1:44 AM
right
after some major digging, we found the colliding input strings:
fbffih2q-n2-0-n3-0-dn0-0
and
f24nu5ka-n2-0-n3-0-dn4-0
hash to the same value:
fldyczra
fbffih2q-n2-0-n3-0-dn1-0
and
f24nu5ka-n2-0-n3-0-dn5-0
hash to:
fkyq4zuy
so between these 2 executions
fbffih2q
and
f24nu5ka
we had like 7 or so collisions 😞
d

Dan Rammer (hamersaw)

08/17/2022, 12:08 PM
OK, first, thanks for finding this! It can't have been fun to track down. For an immediate fix I think it makes sense to increase the hashing algorithm to 64 bit. Obviously collisions may still occur, so we should probably detect launch plan execution id collisions - this is a bit more involved.
🙏 3
k

Ketan (kumare3)

08/17/2022, 1:46 PM
This is very involved
1
🙏 2
j

jeev

08/17/2022, 3:38 PM
we'll look into other things like object lifecycle policies and memoization ttl to reduce the likelihood of this happening. but ultimately that would really just be skirting around the issue.
but not 100% sure if its ok to lifecycle the propeller metadata
k

Ketan (kumare3)

08/17/2022, 3:40 PM
It is ok to lifecycle
j

jeev

08/17/2022, 3:40 PM
that would probably break getting execution data via remote or the console
no?
k

Ketan (kumare3)

08/17/2022, 3:40 PM
Just delete the db entry too
j

jeev

08/17/2022, 3:41 PM
hmm
we're depending too much on flyte metadata right now, but maybe this is something we should look into
a

Anna Cunningham

08/19/2022, 11:26 PM
any suggestions for how we can work around this issue without making any changes in the DB, or is that the only way? the first time this happened, I just relaunched the workflow after renaming an input file so it wouldn’t try reading from the cache. it happened again with another workflow today, and we’d prefer not to relaunch this workflow, as it’s very large and computationally expensive to rerun
j

jeev

08/19/2022, 11:52 PM
i wonder if this says anything about the hash algorithm itself and it’s usefulness here. should we be worried?
it’s especially dangerous because the collisions we have seen are from similar positions in the DAG. as such, it’s possible that there might be a silent failure.
k

Ketan (kumare3)

08/20/2022, 12:14 AM
Hmm the problem is that the generated strong and the hashing algorithm Fnv are unable to have enough entropy
We should prefix the generated string with the exec id
But then we have to conform
j

jeev

08/20/2022, 12:15 AM
@Anna Cunningham would you be open to coordinating a solution with the flyte team on monday? unfortunately i’ll still be OOO.
k

Ketan (kumare3)

08/20/2022, 12:16 AM
Cc @Haytham Abuelfutuh / @Dan Rammer (hamersaw) this is very disturbing
a

Anna Cunningham

08/20/2022, 12:17 AM
I’m not as familiar with the internals but I can do my best! this is definitely an urgent and serious issue for us
k

Ketan (kumare3)

08/20/2022, 12:18 AM
Yes I get and we are here to support
🙌 2
h

Haytham Abuelfutuh

08/23/2022, 3:01 AM
Apologies for the delay, @Anna Cunningham. Was just made aware of this issue. Is there a github issue for it yet? if not, do you mind filing one? happy to help push a quick fix forward…
a

Anna Cunningham

08/23/2022, 3:03 AM
sure where should I file it? not sure if this should go in
flytekit
repo or elsewhere
h

Haytham Abuelfutuh

08/23/2022, 3:03 AM
a

Anna Cunningham

08/23/2022, 3:16 AM
submitted, but please reach out if you need more information!
h

Haytham Abuelfutuh

08/23/2022, 3:17 AM
Thank you!
m

Matheus Moreno

08/24/2022, 8:00 PM
Hey guys, just a heads up, my coworker just had the same problem. It was also a complex workflow comprised of many subworkflows.
k

Ketan (kumare3)

08/24/2022, 8:02 PM
Ohh no
This fix will do it
Sorry about that
m

Matheus Moreno

08/25/2022, 2:43 PM
What's the current workaround for this? Should we delete the project from our database?
a

Anna Cunningham

08/31/2022, 4:56 PM
any updates on this fix? I see the github issue is still open and hasn’t had much activity
k

Ketan (kumare3)

08/31/2022, 6:40 PM
Cc @Haytham Abuelfutuh
w

will danforth

09/01/2022, 5:07 PM
Thanks @Ketan (kumare3) and @Haytham Abuelfutuh for your help on this
Hey @Ketan (kumare3) and @Haytham Abuelfutuh do you have an update on this?
d

Dan Rammer (hamersaw)

09/09/2022, 7:48 PM
@will danforth, thanks for your patience on this! We merged support for 64bit fnv in flyteplugins and should be merging support for this change (with backwards compatability) in FlytePropeller shortly (ie. hopefully later today).
k

Ketan (kumare3)

09/09/2022, 7:54 PM
@Rahul Mehta this should also give you large names
25 Views