Good morning! I'm using a `ShellTask` to `curl + g...
# flyte-support
r
Good morning! I'm using a
ShellTask
to
curl + gunzip
a file. The input filename is formatted like
date.json.gz
and I want to set the output location to
date.json
, that is, to strip off the
.gz
from the
{input.filename}
. I see the
OutputLocation
dataclass allows using a regex: https://github.com/flyteorg/flytekit/blob/f16419136abcf971d30d3398bd7b35a7b6aec904/flytekit/extras/tasks/shell.py#L37 but I can't figure out how that works. Are there any examples? Thanks!
can this help you?
r
No, the text "regex" is not found on that page
a
@ripe-nest-20732 you should be able to use Python native string formatting. So doing something like this should work:
Copy code
OutputLocation(
    var="output_file",
    var_type=FlyteFile,
    location="{re.sub(r'\.gz$', '', inputs.filename)}"
)
let us know if that helps
r
Just to make sure I understand, it can be an arbitrary Python expression (that is valid for an f-string) inside the braces?
a
that's what I get from this class definition and haven't tried my self so could be wrong https://github.com/flyteorg/flytekit/blob/1ffadb56dbe2d2a160dfe4aa745ba46b4120eda4/flytekit/extras/tasks/shell.py#L132
r
Thanks for the reply David! I'll try it out when I switch back to working on that part of the code and let you know 🙂
Hi @average-finland-92144! It turns out that the only permitted operations are the ones in the Format mini-language: https://docs.python.org/3/library/string.html#formatspec So it's not quite like an f-string, but rather a field in a string being passed to
"".format()
So there doesn't seem to be a way to do what I'm trying to do, but I realized I can just pass the
filename
without the
.gz
and then add that suffix in my shell script where needed.
If I try your suggestion above, I get:
Copy code
[0]: Traceback (most recent call last):

      File "/opt/micromamba/envs/runtime/lib/python3.12/site-packages/flytekit/core/base_task.py", line 745, in dispatch_execute
    native_outputs = self.execute(**native_inputs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/micromamba/envs/runtime/lib/python3.12/site-packages/flytekit/core/array_node_map_task.py", line 270, in execute
    return self.python_function_task.execute(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/micromamba/envs/runtime/lib/python3.12/site-packages/flytekit/extras/tasks/shell.py", line 326, in execute
    outputs[v.var] = self._interpolizer.interpolate(v.location, inputs=kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/micromamba/envs/runtime/lib/python3.12/site-packages/flytekit/extras/tasks/shell.py", line 174, in interpolate
    raise ValueError(f"Variable {e} in Query not found in inputs {consolidated_args.keys()}")

Message:

    ValueError: Variable 're' in Query not found in inputs dict_keys(['inputs', 'outputs', 'ctx'])
a
oh good to know so
I can just pass the
filename
without the
.gz
and then add that suffix in my shell script where needed
this works for you?
r
Yep!
And actually I even figured out how to stream the data from the HTTP endpoint without needing the shell task at all 😄
a
Oh that's cool. Thanks for sharing
r
🎉