Hi folks.. seeing something odd.. I have a map tas...
# ask-the-community
r
Hi folks.. seeing something odd.. I have a map task. Apparently all of them have succeeded but the outer task is still running… not quite sure why
s
seems like a bug. would you mind filing an issue?
[flyte-bug]
d
@Rupsha Chaudhuri has this since completed? maptask has a post-processing phase where outputs from each subnode execution are validated and aggregated. For large fanout (in your case 4589) this means that there are ~5k sequential blobstore reads once all subnodes have completed, which can take a little while. In ArrayNode this I/O is parallelized to be more efficient.
k
Can you also please use experimental.map_task
r
I killed the last one after 4 hours.. and the rerun went through. That same task with same input data took 1 hr 45 min
@Dan Rammer (hamersaw) thanks for the explanation
Hey there.. I’ve run into this issue again.. large fanout of 2600 map tasks. What’s this ArrayNode again?
k
Array node is a replacement of map tasks
It allows for better optimizations
What’s the issue
r
exact same problem.. all map tasks apparently succeeded but the outer task has been stuck for hours
k
It could be collecting all data and failing
You probably need higher read write check logs
Outside right now
r
no worries.. we can sync on Monday
each task returns a string “SUCCESS”…
and recovering the workflow succeeded in a matter of few minutes
k
Interesting
r
Same issue again.. Is there any documentation/examples on how to use array nodes? Do I just replace the existing map_task with
Copy code
from flytekit.experimental import map_task
Does it need a min flytekit version?
Is there sample code… when I replaced the map_task with this, the subsequent coalesce steps are failing in the publish and register step when I try to deploy the change to our dev namespace Error seen: “Parameter not bound” for the List param
and the map_task nodes throw this error:
Description: Value required [Target]
Description: Variable [o0] not found on node [n...]
k
are you using experimental?
cc @Dan Rammer (hamersaw) do you know about this one? @Rupsha Chaudhuri what version of the backend and flytekit are you running?
r
yes.. I’m now trying experimental
I was on 1.4… just updated to 1.9 to get flytekit.experimental
Pseudocode
Copy code
outputs = map_task(mytask, concurrency=50)(input=inputs)
combined_output = my_coalesce(outputs=outputs)
Getting an error with the 2nd task for the outputs parameter
@Dan Rammer (hamersaw) any idea?
k
why 1.9? why not 1.11?
r
I arbitrarily picked 1.9 because I saw some doc that mentioned support was added there.. I can bump up to 1.11
same problem with 1.11 as well
This means I have to update flytectl correct?
Installed
flyteorg/flytectl info found version: 0.8.18 for v0.8.18/Linux/x86_64
No luck
ok.. backend.. let me check that
flyteadmin, flytescheduler: v1.1.46 flytepropeller: v1.1.44 flyteconsole: 1.3.4 Which one needs to be bumped up?
and to what version?
this is impacting all our runs of a critical workflow after we bumped the data (and map task fan out) up.. any recommendation for which version I should be updating to?
I think the problem is the flytepropeller is crashing (OOM)… fixing that might resolve this issue. But would be good to use the new map task anyways
k
@Rupsha Chaudhuri how much memory are you running with
How big is the about
r
Copy code
flytepropeller:
  enabled: true
  replicaCount: 2
  image:
    repository: <http://cr.flyte.org/flyteorg/flytepropeller|cr.flyte.org/flyteorg/flytepropeller>
    tag: v1.1.44
    pullPolicy: IfNotPresent
  serviceAccount:
    create: true
    annotations:
      xxx
  resources:
    limits:
      cpu: 200m
      ephemeral-storage: 100Mi
      memory: 200Mi
    requests:
      cpu: 10m
      ephemeral-storage: 50Mi
      memory: 100Mi
k
What that is nothing
r
lol I know
k
Please increase cpu and memory
r
not sure why it was set this low
yup.. doing that
k
Not sure how it was working till now
Man Flyte is really dogged
r
interesting that it didn’t crash all this while given how much I abuse Flyte with map tasks
k
You need to write us a good case study
You were running on a raspberry pi it seems
@Rupsha Chaudhuri is it working fine now?
r
yeah.. it’s not crashing any more. I’ll let you know in a bit if the original issue here is resolved
k
cool
give it a few gbs
r
ok.. that map task issue seems to have gone away.. I’ll keep an eye on it though
d
thanks for your patience here, i've been spread a little thin. sounds like this is all resolved (which is great!). probably what is happening is that maptasks aggregate all of the outputs once they're done. so data from all the subnodes is pulled into memory and then written out. if this gets large enough, especially in conjunction with other running tasks, it was OOM killing propeller. then it would stand up and attempt to perform the maptask again.
r
meanwhile if I were to use experimental.map_task, what version of backend would I need?
d
i would update to lastest. there have been a few minor bug fixes to smooth out some of the rough edges.