<#4313 [BUG] ShellTask deadlocks when child proces...
# flytekit
a
#4313 [BUG] ShellTask deadlocks when child process produces large stderr Issue created by v I noticed that a Flyte ShellTask was stuck in a deadlock because its child process was writing a lot of stuff to stderr. Upon looking at the code, the bug appears to be in this code snippet: https://github.com/flyteorg/flytekit/blob/f16ac4910043a56de235d8dc1383996b6ddd13ef/flytekit/extras/tasks/shell.py#L102-L123 This code is redirecting the child process's output to a pipe and waiting for the process to terminate before reading anything from stderr. As a result, if a child process writes enough data to stderr to fill up the entire PIPE, both the parent process and child process will stop making progress. The parent process will be waiting for the child to terminate, and the child will be waiting for something to read from its stderr pipe to clear it out. A simple repro can be implemented as follows: Create two files:
Copy code
# main.py

import subprocess
import typing

# copy-paste from <https://github.com/flyteorg/flytekit/blob/f16ac4910043a56de235d8dc1383996b6ddd13ef/flytekit/extras/tasks/shell.py#L102-L123>
def _run_script(script) -> typing.Tuple[int, str, str]:
    process = subprocess.Popen(script, stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=0, shell=True, text=True)

    out = ""
    for line in process.stdout:
        print(line)
        out += line

    code = process.wait()
    return code, out, process.stderr.read()

print(_run_script("python error_creator.py"))


# error_creator.py
import sys

for i in range(200000):
    sys.stderr.write("This is an error message\n")

print("This is the output of the program")
Notice that running
python error_creator.py
on its own finishes instantly, but running
python main.py
hangs. If you reduce the number of iterations in
error_creator.py
to 2000, you'll see that
python main.py
finishes instantly too. I originally found this issue in a production Flyte deployment, and used
strace
and
lldb
to verify that the issue is caused by the pipe filling up. Manually reading from the pipe got rid of the deadlock in my case. I ran a command like:
cat /proc/12484/fd/2
flyteorg/flyte