Hi Folks, We need to be able to provide different images for different tasks in a workflow, so I am ...
a

Anindya Saha

over 2 years ago
Hi Folks, We need to be able to provide different images for different tasks in a workflow, so I am testing the Multiple Container Images in a Single Workflow feature. I am using the Whylogs example. It won't work as is in a remote cluster. So, I added the flytecoobook whylogs container image to each task to be able to run the workflow successfully in a remote cluster with
pyflyte run --remote whylogs_example wf
@task(container_image="<http://ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest|ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest>")
def get_reference_data() -> pd.DataFrame:
    ...

@task(container_image="<http://ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest|ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest>")
def get_target_data() -> pd.DataFrame:
    ...

@task(container_image="<http://ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest|ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest>")
def create_profile_view(df: pd.DataFrame) -> DatasetProfileView:
    ...

@task(container_image="<http://ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest|ghcr.io/flyteorg/flytecookbook:whylogs_examples-latest>")
def constraints_report(profile_view: DatasetProfileView) -> bool:
    ...
However, the
get_reference_data
and
get_target_data
should not need whylogs. They just work with pandas and scikit-learn. We should be able to run those tasks with the
@task(container_image="<http://ghcr.io/flyteorg/flytecookbook:core-latest|ghcr.io/flyteorg/flytecookbook:core-latest>")
image. I did try that but it fails, k8s logs say:
File "/opt/venv/lib/python3.8/site-packages/flytekit/core/python_auto_container.py", line 279, in load_task
    task_module = importlib.import_module(name=task_module)  # type: ignore
  ...
  File "/root/whylogs_example.py", line 17, in <module>
    import whylogs as why
ModuleNotFoundError: No module named 'whylogs'
Traceback (most recent call last):
Every task container is trying to parse the entire whyogs_example.py file and since
fytecookbook:core-latest
does not have whylogs it is failing. What is the best design pattern or strategy to be followed in such cases ? How can I make it work remotely ? I read the containerization/multi_images.html, that example has two methods
svm_trainer
and
svm_predictor
but both end up using the same image. All the examples I see in https://github.com/flyteorg/flytesnacks/tree/master/cookbook/case_studies/ml_training also have only one custom docker file. Is there a production grade example workflow with tasks taking different images which are significantly different with each other ? Looking for a reference complex workflow that talks about these nuances on how to organize the pieces together with different custom images for each task.