freezing-tailor-85994
12/09/2024, 6:49 PMflyte USER::Pod failed. No message received from kubernetes.
and failure of the workflow within ~10 seconds of launch (docker images are prefetched onto cluster nodes). I'm 75% sure it's me doing something stupid and I was hoping that a 2nd pair of eyes would catch it.freezing-tailor-85994
12/09/2024, 6:50 PMFROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
# Use a base image with conda pre-installed
RUN apt-get update && apt-get install -y wget libxrender1 && \
rm -rf /var/lib/apt/lists/*
# Download and install Miniconda
RUN wget <https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh> -O /miniconda.sh && \
bash /miniconda.sh -b -p /miniconda && \
rm /miniconda.sh
# Add Conda to PATH
ENV PATH="/miniconda/bin:${PATH}"
RUN conda update conda && conda install -n base conda-libmamba-solver && conda config --set solver libmamba
RUN conda install python==3.12 polars pandas numpy sqlalchemy transformers safetensors pytorch==2.5.1 pytorch-cuda=12.4 wandb lightning -c conda-forge -c pytorch -c nvidia/label/cuda-12.4.1
RUN pip install flytekit flytekitplugins-pod flytekitplugins-sqlalchemy flytekitplugins-huggingface\
torch-cluster -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
torch-geometric -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
torch-scatter -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html>\
torch-sparse -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html> \
torch-spline-conv -f <https://data.pyg.org/whl/torch-2.5.1+cu124.html>
ENTRYPOINT [ "/bin/bash" ]
freezing-tailor-85994
12/09/2024, 6:51 PMfrom flytekit import Resources, task, workflow, PodTemplate
from flytekitplugins.pod import Pod
from src.dataframe_utils import random_dataframe, write_embeddings_s3
from src.protein_utils import calculate_esm_embeddings
import polars as pl
from kubernetes.client.models import V1PodSpec, V1Container, V1ResourceRequirements
import argparse
cpu_container_image = '<http://ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:cpu-prod-2024-12-06|ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:cpu-prod-2024-12-06>'
gpu_container_image = '<http://ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:gpu-prod-2024-12-06|ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/flyte_runtime_containers:gpu-prod-2024-12-06>'
cpu_pod_template=PodTemplate(
annotations={"<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>": "true"},
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
image=cpu_container_image,
resources=V1ResourceRequirements(
limits={'cpu':"1", 'mem':"1Gi", 'ephemeral_storage':'10Gi'}),
),
],
node_selector={'<http://karpenter.sh/nodepool|karpenter.sh/nodepool>': 'cpu-nodepool'}
),
)
gpu_pod_template=PodTemplate(
annotations={"<http://karpenter.sh/do-not-disrupt|karpenter.sh/do-not-disrupt>": "true"},
pod_spec=V1PodSpec(
containers=[
V1Container(
name="primary",
image=gpu_container_image,
resources=V1ResourceRequirements(
limits={'cpu':"1", 'mem':"1Gi", 'ephemeral_storage':'10Gi'}),
),
],
node_selector={'<http://karpenter.sh/nodepool|karpenter.sh/nodepool>': 'cpu-nodepool'}
),
)
@task(
pod_template=cpu_pod_template,
container_image=cpu_container_image,
requests=Resources(cpu="1", mem="1Gi", ephemeral_storage='10Gi'),
limits=Resources(cpu="2", mem="2Gi", ephemeral_storage='10Gi'),
)
def make_dataframe(num_rows:int)->pl.DataFrame:
return random_dataframe(num_rows)
@task(
pod_template=gpu_pod_template,
container_image=gpu_container_image,
requests=Resources(cpu="4", mem="8Gi", gpu=1, ephemeral_storage='10Gi'),
limits=Resources(cpu="8", mem="16Gi", gpu=1, ephemeral_storage='10Gi'),
)
def esm_embed_sequences(sequence_dataframe: pl.DataFrame) -> pl.DataFrame:
return calculate_esm_embeddings(sequence_dataframe)
@task(
pod_template=cpu_pod_template,
container_image=cpu_container_image,
requests=Resources(cpu="4", mem="8Gi", ephemeral_storage='10Gi'),
limits=Resources(cpu="8", mem="16Gi", ephemeral_storage='10Gi')
)
def write_embeddings_df_to_s3(embeddings_df: pl.DataFrame, s3_location: str) -> None:
write_embeddings_s3(embeddings_df, s3_location)
@workflow
def proof_of_concept_wf(num_rows: int, s3_output_loc: str) -> None:
seq_df = make_dataframe(num_rows)
esm_df = esm_embed_sequences(seq_df)
write_results_to_s3 = write_embeddings_df_to_s3(esm_df, s3_output_loc)
Main workflowfreezing-tailor-85994
12/09/2024, 6:51 PMfreezing-tailor-85994
12/09/2024, 7:38 PM(base) bfrench@LM-BFRENCH:~/Documents/Code/flyte-poc$ kubectl logs -n pipeline-poc-development poc-d612956150614cf0b9f-n0-0
/miniconda/bin/pyflyte-fast-execute: line 3: import: command not found
/miniconda/bin/pyflyte-fast-execute: line 4: import: command not found
/miniconda/bin/pyflyte-fast-execute: line 5: from: command not found
/miniconda/bin/pyflyte-fast-execute: pyflyte-fast-execute: line 7: syntax error near unexpected token `('
/miniconda/bin/pyflyte-fast-execute: pyflyte-fast-execute: line 7: ` sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])'
Logs from the pod that executes the codethankful-minister-83577
default
builder for better caching. image spec should be enough for most of the basic dockerfile setups.freezing-tailor-85994
12/10/2024, 2:29 PM