Hi all nice to e meet you I m running into a Flyte issue we Flyte #flyte-support

Hi all, nice to e-meet you! I'm running into a Fly...

ancient-battery-33225

10/03/2025, 5:35 PM

Hi all, nice to e-meet you! I'm running into a Flyte issue we haven't seen before. I have a top-level workflow that contains sub-workflows. One of them is failing leading to the error message "failed to create workflow in propeller etcdserver: request is too large." I looked at the flyte-propeller logs but could not find specific errors. Below are the inputs to this workflow in case that helps. Does anyone have any insight into what this error might indicate?

Copy code

{
  "chunk_wait_seconds": 60,
  "start_datetime": "1/1/2013 12:00:00 AM UTC",
  "qhat_cc": {
    "union": "<gs://planet-forests-jira/FO-955/conformalizese_pv-forests-diligence-canopy-cover-v1.3.0-1x1.csv>"
  },
  "se_decimals_ch": {
    "union": 1
  },
  "overwrite": false,
  "spline_df_cc": {
    "union": 3
  },
  "lambda_ridge_cc": {
    "union": 0.4560787425514926
  },
  "qhat_ch": {
    "union": "<gs://planet-forests-jira/FO-955/conformalizese_pv-forests-diligence-canopy-height-v1.3.0-1x1.csv>"
  },
  "feature_scaler_path_ch": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-canopy_height.train.features.robust.scaler.pck>"
  },
  "cv_threshold_cc": {
    "union": 0.012505642062132475
  },
  "ramp_up_factor": 10,
  "gedify_model_path": {
    "union": "<gs://pv-forests-diligence-training/models/forest-observatory/model-registry/agb:v32/model.joblib>"
  },
  "se_decimals_cc": {
    "union": 0
  },
  "denoise_asset_keys": {
    "union": [
      "denoised",
      "denoised_se",
      "change_category"
    ]
  },
  "update_timeseries": true,
  "steps_to_skip": "(empty)",
  "model_config_paths_ch": {
    "union": [
      "<gs://pv-forests-diligence-training/models/diligence-v3-canopy_height-04b/config.yml>"
    ]
  },
  "cv_threshold_ch": {
    "union": 0.32753039812553936
  },
  "spline_df_ch": {
    "union": 3
  },
  "aic_threshold_ch": {
    "union": 9.270085537273262
  },
  "aic_threshold_cc": {
    "union": 7.70976136557189
  },
  "published_asset_keys": {
    "union": [
      [
        "data",
        "uncertainty",
        "change_category",
        "dayofyear",
        "score"
      ],
      [
        "data",
        "uncertainty",
        "change_category",
        "dayofyear",
        "score"
      ],
      [
        "data",
        "uncertainty",
        "dayofyear",
        "score"
      ]
    ]
  },
  "feature_scaler_path_cc": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-cover.train.features.robust.scaler.pck>"
  },
  "aoi": {
    "tag": "WKB (binary data not shown)"
  },
  "response_scaler_path_ch": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-canopy_height.train.response.robust.scaler.pck>"
  },
  "version": "v1.3.0.test",
  "denoise_prediction_version": {
    "union": "v1.1.0"
  },
  "response_scaler_path_cc": {
    "union": "<gs://pv-forests-diligence-training/libraries/diligence-v3-cover.train.response.robust.scaler.pck>"
  },
  "lambda_ridge_ch": {
    "union": 0.6031510586243339
  },
  "priority": 0,
  "model_config_paths_cc": {
    "union": [
      "<gs://pv-forests-diligence-training/models/diligence-v3-cover-04/config.yml>"
    ]
  },
  "end_datetime": "1/1/2025 12:00:00 AM UTC"
}

clean-glass-36808

10/03/2025, 7:59 PM

Flyte workflow resources are stored in etcD and etcD has size limits on the workflow resource definition. Its more that just the inputs. Its also related to the DAG definition, the outputs, the current status, etc.

clean-glass-36808

10/03/2025, 8:03 PM

You might be able to reduce the workflow size by offloading static elements of the workflow spec to blob storage: https://www.union.ai/docs/v1/flyte/deployment/flyte-configuration/performance/#offloading-static-workflow-information-from-crd You may also just need to rework the structure of the workflow. Of you can reconfigure etcD's size limits

clean-glass-36808

10/03/2025, 8:03 PM

If you share the entire workflow definition we might be able to help guide you in the right direction.

freezing-airport-6809

10/04/2025, 11:56 PM

it might be just a very large map task?

freezing-airport-6809

10/04/2025, 11:56 PM

Our V2 architecture will get rid of this problem

ancient-battery-33225

10/15/2025, 11:50 AM

@clean-glass-36808 and @freezing-airport-6809, sorry for the late reply as I was out on vacation, and thanks a lot for your input! We're not ready for V2 migration yet, so we're looking into Jason's suggestions now.

colossal-nightfall-74781

10/15/2025, 5:23 PM

Hi @clean-glass-36808, I work with Dieu My on this. We already have our cluster configured with

useOffloadedWorkflowClosure=true

The workflow we run calls a reference workflow that is managed in a different repo. This is the sub workflow that fails. When calling this workflow from within its original repo, it works fine. The reference_workflow signature is as followed:

Copy code

@reference_launch_plan(
    project="forests",
    domain="live",
    name="mycobiome.data_products.diligence.flyte.base_workflow.diligence_workflow",
    version=REFERENCE_VERSION,
)
def mycobiome_diligence_workflow(
    aoi: BaseGeometry,
    start_datetime: dt.datetime,
    end_datetime: dt.datetime,
    overwrite: bool,
    update_timeseries: bool,
    chunk_wait_seconds: int,
    ramp_up_factor: int,
    steps_to_skip: list[str] | None,
    version: str,
    model_config_paths_cc: list[str] | str | None,
    feature_scaler_path_cc: str | None,
    response_scaler_path_cc: str | None,
    qhat_cc: str | None,
    model_config_paths_ch: list[str] | str | None,
    feature_scaler_path_ch: str | None,
    response_scaler_path_ch: str | None,
    qhat_ch: str | None,
    cv_threshold_cc: float | None,
    cv_threshold_ch: float | None,
    lambda_ridge_cc: float | None,
    lambda_ridge_ch: float | None,
    aic_threshold_cc: float | None,
    aic_threshold_ch: float | None,
    se_decimals_cc: int | None,
    se_decimals_ch: int | None,
    spline_df_cc: int | None,
    spline_df_ch: int | None,
    denoise_asset_keys: list[str] | None,
    denoise_prediction_version: str | None,
    gedify_model_path: str | None,
    published_asset_keys: list[list[str]] | None,
    priority: int = 0,
) -> None: ...

ancient-battery-33225

10/29/2025, 11:45 PM

Hi @clean-glass-36808, just circling back to this thread in case we can get some further help on this issue. Thanks!

2 Views

Open in Slack

Previous Next