Hi all, nice to e-meet you! I'm running into a Fly...
# flyte-support
a
Hi all, nice to e-meet you! I'm running into a Flyte issue we haven't seen before. I have a top-level workflow that contains sub-workflows. One of them is failing leading to the error message "failed to create workflow in propeller etcdserver: request is too large." I looked at the flyte-propeller logs but could not find specific errors. Below are the inputs to this workflow in case that helps. Does anyone have any insight into what this error might indicate?
Copy code
{
  "chunk_wait_seconds": 60,
  "start_datetime": "1/1/2013 12:00:00 AM UTC",
  "qhat_cc": {
    "union": "<gs://planet-forests-jira/FO-955/conformalizese_pv-forests-diligence-canopy-cover-v1.3.0-1x1.csv>"
  },
  "se_decimals_ch": {
    "union": 1
  },
  "overwrite": false,
  "spline_df_cc": {
    "union": 3
  },
  "lambda_ridge_cc": {
    "union": 0.4560787425514926
  },
  "qhat_ch": {
    "union": "<gs://planet-forests-jira/FO-955/conformalizese_pv-forests-diligence-canopy-height-v1.3.0-1x1.csv>"
  },
  "feature_scaler_path_ch": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-canopy_height.train.features.robust.scaler.pck>"
  },
  "cv_threshold_cc": {
    "union": 0.012505642062132475
  },
  "ramp_up_factor": 10,
  "gedify_model_path": {
    "union": "<gs://pv-forests-diligence-training/models/forest-observatory/model-registry/agb:v32/model.joblib>"
  },
  "se_decimals_cc": {
    "union": 0
  },
  "denoise_asset_keys": {
    "union": [
      "denoised",
      "denoised_se",
      "change_category"
    ]
  },
  "update_timeseries": true,
  "steps_to_skip": "(empty)",
  "model_config_paths_ch": {
    "union": [
      "<gs://pv-forests-diligence-training/models/diligence-v3-canopy_height-04b/config.yml>"
    ]
  },
  "cv_threshold_ch": {
    "union": 0.32753039812553936
  },
  "spline_df_ch": {
    "union": 3
  },
  "aic_threshold_ch": {
    "union": 9.270085537273262
  },
  "aic_threshold_cc": {
    "union": 7.70976136557189
  },
  "published_asset_keys": {
    "union": [
      [
        "data",
        "uncertainty",
        "change_category",
        "dayofyear",
        "score"
      ],
      [
        "data",
        "uncertainty",
        "change_category",
        "dayofyear",
        "score"
      ],
      [
        "data",
        "uncertainty",
        "dayofyear",
        "score"
      ]
    ]
  },
  "feature_scaler_path_cc": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-cover.train.features.robust.scaler.pck>"
  },
  "aoi": {
    "tag": "WKB (binary data not shown)"
  },
  "response_scaler_path_ch": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-canopy_height.train.response.robust.scaler.pck>"
  },
  "version": "v1.3.0.test",
  "denoise_prediction_version": {
    "union": "v1.1.0"
  },
  "response_scaler_path_cc": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-cover.train.response.robust.scaler.pck>"
  },
  "lambda_ridge_ch": {
    "union": 0.6031510586243339
  },
  "priority": 0,
  "model_config_paths_cc": {
    "union": [
      "<gs://pv-forests-diligence-training/models/diligence-v3-cover-04/config.yml>"
    ]
  },
  "end_datetime": "1/1/2025 12:00:00 AM UTC"
}
c
Flyte workflow resources are stored in etcD and etcD has size limits on the workflow resource definition. Its more that just the inputs. Its also related to the DAG definition, the outputs, the current status, etc.
You might be able to reduce the workflow size by offloading static elements of the workflow spec to blob storage: https://www.union.ai/docs/v1/flyte/deployment/flyte-configuration/performance/#offloading-static-workflow-information-from-crd You may also just need to rework the structure of the workflow. Of you can reconfigure etcD's size limits
If you share the entire workflow definition we might be able to help guide you in the right direction.
f
it might be just a very large map task?
Our V2 architecture will get rid of this problem
a
@clean-glass-36808 and @freezing-airport-6809, sorry for the late reply as I was out on vacation, and thanks a lot for your input! We're not ready for V2 migration yet, so we're looking into Jason's suggestions now.
c
Hi @clean-glass-36808, I work with Dieu My on this. We already have our cluster configured with
useOffloadedWorkflowClosure=true
The workflow we run calls a reference workflow that is managed in a different repo. This is the sub workflow that fails. When calling this workflow from within its original repo, it works fine. The reference_workflow signature is as followed:
Copy code
@reference_launch_plan(
    project="forests",
    domain="live",
    name="mycobiome.data_products.diligence.flyte.base_workflow.diligence_workflow",
    version=REFERENCE_VERSION,
)
def mycobiome_diligence_workflow(
    aoi: BaseGeometry,
    start_datetime: dt.datetime,
    end_datetime: dt.datetime,
    overwrite: bool,
    update_timeseries: bool,
    chunk_wait_seconds: int,
    ramp_up_factor: int,
    steps_to_skip: list[str] | None,
    version: str,
    model_config_paths_cc: list[str] | str | None,
    feature_scaler_path_cc: str | None,
    response_scaler_path_cc: str | None,
    qhat_cc: str | None,
    model_config_paths_ch: list[str] | str | None,
    feature_scaler_path_ch: str | None,
    response_scaler_path_ch: str | None,
    qhat_ch: str | None,
    cv_threshold_cc: float | None,
    cv_threshold_ch: float | None,
    lambda_ridge_cc: float | None,
    lambda_ridge_ch: float | None,
    aic_threshold_cc: float | None,
    aic_threshold_ch: float | None,
    se_decimals_cc: int | None,
    se_decimals_ch: int | None,
    spline_df_cc: int | None,
    spline_df_ch: int | None,
    denoise_asset_keys: list[str] | None,
    denoise_prediction_version: str | None,
    gedify_model_path: str | None,
    published_asset_keys: list[list[str]] | None,
    priority: int = 0,
) -> None: ...
a
Hi @clean-glass-36808, just circling back to this thread in case we can get some further help on this issue. Thanks!