Hi y'all Just working on putting up flyte for che...
# flyte-support
f
Hi y'all Just working on putting up flyte for chemoproteomics research on k8s via Amazon EKS and while all the deployments and hello world scale demos work pretty cleanly, I'm hitting a weird snag on the first custom deployment (Code and dockerfiles in thread). I keep seeing
flyte USER::Pod failed. No message received from kubernetes.
and failure of the workflow within ~10 seconds of launch (docker images are prefetched onto cluster nodes). I'm 75% sure it's me doing something stupid and I was hoping that a 2nd pair of eyes would catch it.
Dockerfile:
Copy code
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04

# Use a base image with conda pre-installed
RUN apt-get update && apt-get install -y wget libxrender1 && \
    rm -rf /var/lib/apt/lists/*

# Download and install Miniconda
RUN wget <https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh> -O /miniconda.sh && \
    bash /miniconda.sh -b -p /miniconda && \
    rm /miniconda.sh

# Add Conda to PATH
ENV PATH="/miniconda/bin:${PATH}"
RUN conda update conda && conda install -n base conda-libmamba-solver && conda config --set solver libmamba
RUN conda install python==3.12 polars pandas numpy sqlalchemy transformers safetensors pytorch==2.5.1 pytorch-cuda=12.4 wandb lightning -c conda-forge -c pytorch -c nvidia/label/cuda-12.4.1
RUN pip install flytekit flytekitplugins-pod flytekitplugins-sqlalchemy flytekitplugins-huggingface\
    torch-cluster -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
    torch-geometric -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
    torch-scatter -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html>\
    torch-sparse -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
    torch-spline-conv -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html>

ENTRYPOINT [ "/bin/bash" ]
Copy code
from flytekit import Resources, task, workflow, PodTemplate
from flytekitplugins.pod import Pod
from src.dataframe_utils import random_dataframe, write_embeddings_s3
from src.protein_utils import calculate_esm_embeddings
import polars as pl
from kubernetes.client.models import V1PodSpec, V1Container, V1ResourceRequirements
import argparse

cpu_container_image = '<http://ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:cpu-prod-2024-12-06|ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:cpu-prod-2024-12-06>'
gpu_container_image = '<http://ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:gpu-prod-2024-12-06|ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:gpu-prod-2024-12-06>'

cpu_pod_template=PodTemplate(
        annotations={"<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>": "true"},
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="primary",
                    image=cpu_container_image,
                    resources=V1ResourceRequirements(
                        limits={'cpu':"1", 'mem':"1Gi", 'ephemeral_storage':'10Gi'}),
                ),
            ],
            node_selector={'<http://karpenter.sh/nodepool|karpenter.sh/nodepool>': 'cpu-nodepool'}    
            ),
        
    )

gpu_pod_template=PodTemplate(
        annotations={"<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>": "true"},
        pod_spec=V1PodSpec(
            containers=[
                V1Container(
                    name="primary",
                    image=gpu_container_image,
                    resources=V1ResourceRequirements(
                        limits={'cpu':"1", 'mem':"1Gi", 'ephemeral_storage':'10Gi'}),
                ),
            ],
        node_selector={'<http://karpenter.sh/nodepool|karpenter.sh/nodepool>': 'cpu-nodepool'}    
        ),
)



@task(
    pod_template=cpu_pod_template,
    container_image=cpu_container_image,
    requests=Resources(cpu="1", mem="1Gi", ephemeral_storage='10Gi'),
    limits=Resources(cpu="2", mem="2Gi", ephemeral_storage='10Gi'),
)
def make_dataframe(num_rows:int)->pl.DataFrame:
    return random_dataframe(num_rows)


@task(
    pod_template=gpu_pod_template,
    container_image=gpu_container_image,
    requests=Resources(cpu="4", mem="8Gi", gpu=1, ephemeral_storage='10Gi'),
    limits=Resources(cpu="8", mem="16Gi", gpu=1, ephemeral_storage='10Gi'),
)
def esm_embed_sequences(sequence_dataframe: pl.DataFrame) -> pl.DataFrame:
    return calculate_esm_embeddings(sequence_dataframe)


@task(
    pod_template=cpu_pod_template,
    container_image=cpu_container_image,
    requests=Resources(cpu="4", mem="8Gi", ephemeral_storage='10Gi'),
    limits=Resources(cpu="8", mem="16Gi", ephemeral_storage='10Gi')
)
def write_embeddings_df_to_s3(embeddings_df: pl.DataFrame, s3_location: str) -> None:
    write_embeddings_s3(embeddings_df, s3_location)


@workflow
def proof_of_concept_wf(num_rows: int, s3_output_loc: str) -> None:
    seq_df = make_dataframe(num_rows)
    esm_df = esm_embed_sequences(seq_df)
    write_results_to_s3 = write_embeddings_df_to_s3(esm_df, s3_output_loc)
Main workflow
I'm not going to bother with the code for individual functions since they're not entering so the crash can't live there
Copy code
(base) bfrench@LM-BFRENCH:~/Documents/Code/flyte-poc$ kubectl logs -n pipeline-poc-development poc-d612956150614cf0b9f-n0-0    
/miniconda/bin/pyflyte-fast-execute: line 3: import: command not found
/miniconda/bin/pyflyte-fast-execute: line 4: import: command not found
/miniconda/bin/pyflyte-fast-execute: line 5: from: command not found
/miniconda/bin/pyflyte-fast-execute: pyflyte-fast-execute: line 7: syntax error near unexpected token `('
/miniconda/bin/pyflyte-fast-execute: pyflyte-fast-execute: line 7: `    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])'
Logs from the pod that executes the code
t
thank you for finding/fixing! Take a look at the image spec options when you have a chance. Use the
default
builder for better caching. image spec should be enough for most of the basic dockerfile setups.
f
I thought about it, but by virtue of the space I work in, there's a lot of shell task stuff with really custom installs, so I wanted to PoC bringing our own dockerfile because tools like JackHmmer or GROMACS aren't going to play as nice with pip/conda style installs