Eli Bixby
04/17/2023, 11:23 AMFlyteFile[Format]
syntax with `StructuredDataset`s? It looks like they are backed by a different proto, so it's not clear to me how that works.Ketan (kumare3)
Eli Bixby
04/17/2023, 1:34 PMFlyteFile[structured_dataset.PARQUET]
?Ketan (kumare3)
Eli Bixby
04/17/2023, 2:03 PMFlyteFile[PyTorchModule]
etc and then creating a flytefile when registering the launch plan, but I'm not sure if there's a way to do this with a StructuredDatasetFlyteFile[structured_dataset.PARQUET]
works fine for inputting as a launch plan parameter, but we can't figure out how to pass the promise to a task that takes a StructuredDataset
as an input.nn.Module
or np.ndarray
) . That doesn't work for StructuredDataset
. We get an error (will paste when I find it), that the input type doesn't match the expected type.Kevin Su
04/17/2023, 3:48 PMStructuredDataset.uri
Daniel Danciu
04/17/2023, 6:07 PM@task
def do_task(a: StructuredDataset) -> int:
...
@workflow
def do_workflow(a: FlyteFile[structured_dataset.PARQUET]):
...
LaunchPlan.create('PlanB', do_workflow, default_params={'a': FlyteFile('<gs://path_to_flyte_parquet_output>')})
And this is the error we are getting:
Error 0: Code: MismatchingTypes, Node Id: n1, Description: Variable [a] (type [blob:<format:"parquet" > ]) doesn't match expected type [structured_dataset_type:<> ].
Using FlyteFile[NumpyArrayTransformer.NUMPY_ARRAY_FORMAT]
in the workflow and receiving np.ndarray
in the task works fine.Yee
Daniel Danciu
04/17/2023, 9:30 PMdefault_inputs={'a': StructuredDataset(uri='gs://...', file_format=structured_dataset.PARQUET)}
This runs into the following error:
TypeError: int() argument must be a string, a bytes-like object or a number, not '_NoValueType'
when reading the received parameter in the task using:
a.open(pd.DataFrame).all()
(I declared a
as being StructuredDataset
in the task)@task
def do_task(a: Annotated[StructuredDataset, kwtypes(my_column: float)]) -> int:
...
@workflow
def do_workflow(a: StructuredDataset):
...
LaunchPlan.create('PlanB', do_workflow, default_params={'a': StructuredDataset(uri='gs://...', file_format=PARQUET)})
Obviously one can just use StructuredDataset
instad of Annotated[StructuredDataset, kwtypes(my_column: float)]