hi flyte team, we recently upgraded the console to...
# announcements
a
hi flyte team, we recently upgraded the console to the latest and now it doesn't show dynamic subtasks. Is there a switch/config setting for that?
h
Cc @Jason Porter @Nastya Rusina
a
Copy code
flyteadmin_version     = "v1.1.21"
  flyteconsole_version   = "v1.1.0"
  flytecopilot_version   = "v0.0.26"
  flytepropeller_version = "v1.1.12"
n
Does it happen for all three views: Node executions/Graph/Timeline? Can you please provide a sample/screenshot of what you see and where the info is missing.
a
yes, all views
checkerboard_dynamic_tasks
is there one that supposed to have subtasks
j
Okay thanks @Alex Pryiomka - we'll take a look
a
the version info i provided is wrong
is there a way to see versions in the console?
j
Yes. There is a little "i" in the top right corner; clicking that will open version information 👍
a
UI Version 1.1.0 Admin Version 1.1.21
so it was correct 🙂
j
Okay great - we're going to track this bug fix here 👍 https://github.com/flyteorg/flyteconsole/issues/512
thx 1
a
@Jason Porter - in case this is helpful - the console shows subtasks while workflow is running, but they disappear for completed workflows. It looks particularly odd when a subtask fails, so Flyte marks the workflow as failed, yet the attempts are marked as successful and no errors in the top level logs
n
@Alex Pozimenko Can you please check if old executions (prior to update) for same workflow show sub-workflow items?
a
@Nastya Rusina - yes, old executons look fine.
🙇‍♀️ 1
n
Thanks for confirming. We will dig deeper, but I suspect that something changed in the structure of the returned DAG. cc: @Haytham Abuelfutuh
e
Hi @Alex Pozimenko I I have tried to reproduce the issue on my site. However, can't really reproduce it. Are you free sometimes to have a short call? just want to make sure the api payload is correct.
a
hey @eugene jahn, how about 1pm today?
e
how about 2pm PST?
e
Can we do 2pm PST?
j
2pm works for me 👍
a
sgtm. alex.pozimenko@woven-planet.global
e
sent invitation! see you later
a
sorry, in a mtg which will likely run a little over, so will join 5 min later
👍 1
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "f67f46d3206de43699b7"
        }
      },
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-f67f46d3206de43699b7/start-node/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "created_at": "2022-06-21T20:16:34.290172734Z",
        "updated_at": "2022-06-21T20:16:34.290172734Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "f67f46d3206de43699b7"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-f67f46d3206de43699b7/n0/data/inputs.pb>",
      "closure": {
        "error": {
          "code": "RetriesExhausted|USER:Unknown",
          "message": "[2/2] currentAttempt done. Last Error: USER::Traceback (most recent call last):\n\n      File \"/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.8/site-packages/flytekit/exceptions/scopes.py\", line 203, in user_entry_point\n        return wrapped(*args, **kwargs)\n      File \"/root/flyte/flyte/tasks/fs1_training_data.py\", line 88, in get_annotation_info_task\n        call_scene_reconstruction_binary(\n      File \"/root/flyte/flyte/commands/scene_command.py\", line 51, in call_scene_reconstruction_binary\n        subprocess_handler.run(binary=command, args=params, log_stdout=True)\n      File \"/root/cli/cli/subprocess_handler.py\", line 53, in run\n        raise subprocess.CalledProcessError(returncode=exit_code, cmd=command, output=stdout, stderr=stderr)\n\nMessage:\n\n    Command '['annotation', 'query', '--verbose', '--data-source', 'FS1', '--label-source', 'SCALE', '--output-file', '/tmp/flyte/20220621_202627/sandbox/local_flytekit/39e65c520948189a3f2663c3f94842e3/annotations.csv', '--scale-project-name', 'panda_lfd_lidar', '--task-id', '60c72d8333efc20018b6ea0d']' returned non-zero exit status 1.\n\nUser error.",
          "kind": "USER"
        },
        "phase": "FAILED",
        "started_at": "2022-06-21T20:16:34.460611038Z",
        "duration": "804.056366105s",
        "created_at": "2022-06-21T20:16:34.354000459Z",
        "updated_at": "2022-06-21T20:29:58.516977105Z"
      },
      "metadata": {
        "spec_node_id": "n0",
        "is_dynamic": true
      }
    }
  ]
}
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "xcipb7ytwp"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/start-node/data/inputs.pb>",
      "closure": {
        "phase": "SUCCEEDED",
        "created_at": "2021-09-23T13:27:27.874059940Z",
        "updated_at": "2021-09-23T13:27:27.874059940Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "xcipb7ytwp"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/n0/data/inputs.pb>",
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/n0/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "started_at": "2021-09-23T13:27:28.357826621Z",
        "duration": "11798.029477360s",
        "created_at": "2021-09-23T13:27:28.089924503Z",
        "updated_at": "2021-09-23T16:44:06.387304360Z"
      },
      "metadata": {
        "is_parent_node": true,
        "spec_node_id": "n0"
      }
    },
    {
      "id": {
        "node_id": "n1",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "xcipb7ytwp"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/n1/data/inputs.pb>",
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/n1/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "started_at": "2021-09-23T16:44:52.527074281Z",
        "duration": "548.360438964s",
        "created_at": "2021-09-23T16:44:52.317653506Z",
        "updated_at": "2021-09-23T16:54:00.887512964Z"
      },
      "metadata": {
        "spec_node_id": "n1"
      }
    },
    {
      "id": {
        "node_id": "end-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "xcipb7ytwp"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-xcipb7ytwp/end-node/data/inputs.pb>",
      "closure": {
        "phase": "SUCCEEDED",
        "created_at": "2021-09-23T16:54:01.094210244Z",
        "updated_at": "2021-09-23T16:54:01.446059558Z"
      },
      "metadata": {
        "spec_node_id": "end-node"
      }
    }
  ]
}
example of failed old execution that shows expand option:
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "f860a446514bb4e07be0"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-f860a446514bb4e07be0/start-node/data/inputs.pb>",
      "closure": {
        "phase": "SUCCEEDED",
        "created_at": "2021-09-22T15:58:52.269739635Z",
        "updated_at": "2021-09-22T15:58:52.269739635Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "f860a446514bb4e07be0"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-f860a446514bb4e07be0/n0/data/inputs.pb>",
      "closure": {
        "error": {
          "code": "RetriesExhausted|USER:Unknown",
          "message": "[2/2] currentAttempt done. Last Error: USER::Traceback (most recent call last):\n\n      File \"/root/.local/share/virtualenvs/root-BuDEOXnJ/lib/python3.8/site-packages/flytekit/common/exceptions/scopes.py\", line 203, in user_entry_point\n        return wrapped(*args, **kwargs)\n      File \"/root/flyte/flyte/tasks/fs1_training_data.py\", line 324, in generate_l5mldatastore_dataset_task\n        call_scene_reconstruction_binary(\n      File \"/root/flyte/flyte/commands/scene_command.py\", line 51, in call_scene_reconstruction_binary\n        subprocess_handler.run(binary=command, args=params, log_stdout=True)\n      File \"/root/cli/cli/subprocess_handler.py\", line 53, in run\n        raise subprocess.CalledProcessError(returncode=exit_code, cmd=command, output=stdout, stderr=stderr)\n\nMessage:\n\n    Command '['annotation', 'write-training-data-chunk-l5mldatastore', '/tmp/flytezs70k80t/local_flytekit/92f659b1ba7cba47388b85d2c3ca177f/', '--verbose', '--chunk-id', '217', '--dataset-name', 'fs1_stereo_dataset', '--dataset-version', '0.0.3071-main.32268f0_0.0.1', '--end-ts', '1621979365', '--filtered-tracks-pb', '/tmp/flytezs70k80t/local_flytekit/8f281133547f3cb71452f2ef735477d1/filtered_tracks.pb', '--mission-id', '1863992517161235894_9314680418666361737', '--obstacles-frame', 'camera', '--output-file', '/tmp/flytezs70k80t/20210922_171104/local_flytekit/740367bf4e8c77e6ae90c2d4b3f0aa40/l5mldatastore_chunk_metadata_217.json', '--partition-name', 'train', '--start-ts', '1621979337']' returned non-zero exit status 1.\n\nUser error.",
          "kind": "USER"
        },
        "phase": "FAILED",
        "started_at": "2021-09-22T15:58:52.479254644Z",
        "duration": "4790.533657594s",
        "created_at": "2021-09-22T15:58:52.358488635Z",
        "updated_at": "2021-09-22T17:18:43.012912594Z"
      },
      "metadata": {
        "is_parent_node": true,
        "spec_node_id": "n0"
      }
    }
  ]
}
(^^^these are private links only available to wp employees)
h
Hey @Alex Pozimenko, sorry you are facing problems with the upgrade. If you don’t mind double confirming this, both the older and more recent executions ran the same version of the worklfow?
Mind also checking what version of FlytePropeller are you running?
@katrina Do you think this’s related to this change? https://github.com/flyteorg/flyteadmin/pull/382 I see is_parent_node is no longer set to true… but it should be, right?
k
we should only ever set is_parent to true but never go from true to false (with the case you were seeing that running workflows showing the subtasks but completed ones not)
hey @Alex Pozimenko if you don't mind, would it be possible to get the same node execution json you shared above when it is running and the console is showing subtasks?
e
@katrina this is the example that executed success before the update https://jsonblob.com/988938897279172608
🙏 1
k
huh so
is_parent_node
is indeed set
how long do the subtasks run for? are they particularly short-lived?
a
if you don’t mind double confirming this, both the older and more recent executions ran the same version of the worklfow?
@Haytham Abuelfutuh the versions are different (they're 2 months apart)
@Haytham Abuelfutuh flytepropeller-v1.1.12
h
thank you
a
i have the workflow running now, is this the json response you need?
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "a8djqk9pzmdgfdjf75lx"
        }
      },
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-a8djqk9pzmdgfdjf75lx/start-node/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "created_at": "2022-06-21T23:38:52.525455365Z",
        "updated_at": "2022-06-21T23:38:52.525455365Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "a8djqk9pzmdgfdjf75lx"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-a8djqk9pzmdgfdjf75lx/n0/data/inputs.pb>",
      "closure": {
        "phase": "RUNNING",
        "started_at": "2022-06-21T23:38:52.669382566Z",
        "created_at": "2022-06-21T23:38:52.616230119Z",
        "updated_at": "2022-06-21T23:38:52.669382566Z"
      },
      "metadata": {
        "is_parent_node": true,
        "spec_node_id": "n0"
      }
    }
  ]
}
or this (same execution as above):
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "n0-0-start-node",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "a8djqk9pzmdgfdjf75lx"
        }
      },
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-a8djqk9pzmdgfdjf75lx/n0/data/0/start-node/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "created_at": "2022-06-21T23:41:52.895604301Z",
        "updated_at": "2022-06-21T23:41:52.895604301Z"
      },
      "metadata": {
        "retry_group": "0",
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0-0-dn0",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "a8djqk9pzmdgfdjf75lx"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-a8djqk9pzmdgfdjf75lx/n0/data/0/dn0/inputs.pb>",
      "closure": {
        "phase": "RUNNING",
        "started_at": "2022-06-21T23:41:55.385688619Z",
        "created_at": "2022-06-21T23:41:53.138938047Z",
        "updated_at": "2022-06-21T23:41:55.385688619Z",
        "workflow_node_metadata": {
          "executionId": {
            "project": "avfleetscenes",
            "domain": "dev",
            "name": "foyc3j4i"
          }
        }
      },
      "metadata": {
        "retry_group": "0",
        "spec_node_id": "dn0"
      }
    },
    {
      "id": {
        "node_id": "n0-0-dn1",
        "execution_id": {
          "project": "avfleetscenes",
          "domain": "dev",
          "name": "a8djqk9pzmdgfdjf75lx"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avfleetscenes-dev-a8djqk9pzmdgfdjf75lx/n0/data/0/dn1/inputs.pb>",
      "closure": {
        "phase": "RUNNING",
        "started_at": "2022-06-21T23:41:55.677555860Z",
        "created_at": "2022-06-21T23:41:53.193659416Z",
        "updated_at": "2022-06-21T23:41:55.677555860Z",
        "workflow_node_metadata": {
          "executionId": {
            "project": "avfleetscenes",
            "domain": "dev",
            "name": "f4gdb2lq"
          }
        }
      },
      "metadata": {
        "retry_group": "0",
        "spec_node_id": "dn1"
      }
    }
  ]
}
in case this helps, the console continues to show expand option after completion if execution was open while it was still running. But if I refresh the page, the expand option disappear
k
this definitely sounds like a back-end issue overwriting the is parent node bit. thanks @Alex Pozimenko for all the helpful reporting, i will take a look at fixing this
👍 1
hey @Alex Pozimenko just to double check, what flyteadmin version were you on before you upgraded?
a
v0.6.112
1
k
hey @Jason Porter for my understanding, what indicates to the UI that it should show subtasks?
@Alex Pozimenko could you share the workflow definition (with anything sensitive scrubbed out?)
j
That's kinda a complex question 😅 but yes, *generally speaking (in terms of checking existence and getting phase) we key off of
is_parent_node
k
thanks Jason, @Eugene Jahn confirmed for me in DM 😄
also @Alex Pozimenko sorry one more q, just to double check the flytepropeller deployment is still using v1.1.12?
✔️ 1
a
@katrina - sanitized wf definition. Hopefully i didn't remove anything material:
Copy code
from flytekit import task, dynamic, workflow, Resources
from flytekit.core.node_creation import create_node

from typing import List, Tuple, NamedTuple

SceneLevelProcessingResults = NamedTuple("OP2",
                                         num_scenes_published=int)


@task(requests=Resources(mem='4G'), retries=6)
def simulation_metadata_collect(run_id: str) -> pd.DataFrame:
    """Collect all task metadata for downstream workers"""
    # experiment_task_metadata_df = ...
    # return experiment_task_metadata_df


@dynamic
def checkerboard_dynamic_tasks(run_id: str,
                               number_shards: int) -> Tuple[List[int], List[int]]:
    num_issues_from_shards = []
    num_scenes_from_shards = []

    for i in range(number_shards):
        scene_level_processing_results = checkerboard_scene_level_processing(
            run_id=run_id,
            shard=i,
            number_shards=number_shards)
        num_issues_from_shards.append(scene_level_processing_results.num_scenes_published)
        num_scenes_from_shards.append(scene_level_processing_results.num_scenes_published)

    return num_issues_from_shards, num_scenes_from_shards


@task(requests=Resources(mem='10G'), retries=5)
def checkerboard_scene_level_processing(run_id: str,
                                        shard: int = 0,
                                        number_shards: int = 1,
                                        ) -> SceneLevelProcessingResults:

    scene_pipeline = InitPipeline(run_id=run_id,
                                    number_shards=number_shards,
                                    shard=shard)
    scenes: List[CheckerboardScene] = scene_pipeline.initialize_scenes()
    # ... 

    return SceneLevelProcessingResults(num_scenes_published=len(scenes))


@workflow
def CheckerboardParallelBackendLaunch(run_id: str, number_shards: int = 1):
    metadata_collect_task = create_node(simulation_metadata_collect, run_id=run_id)

    # WF to calculate issues, dynamically scaled by amount of metrics to process
    dynamic_tasks = create_node(checkerboard_dynamic_tasks,
                                run_id=run_id,
                                number_shards=number_shards)

    metadata_collect_task >> dynamic_tasks
k
thanks @Alex Pozimenko and just to double check what does InitPipeline do?
a
it simply initializes scene_pipeline object that is used to get a list of scenes
k
sigh still can't repro, even after the dynamic trask succeeds i see
Copy code
"id": {
"node_id": "n1",
"execution_id": {
"project": "flytesnacks",
"domain": "development",
"name": "f629e8fae1aa143e1a14"
}
},
"input_uri": "<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f629e8fae1aa143e1a14/n1/data/inputs.pb>",
"closure": {
"output_uri": "<s3://my-s3-bucket/metadata/propeller/flytesnacks-development-f629e8fae1aa143e1a14/n1/data/0/outputs.pb>",
"phase": "SUCCEEDED",
"started_at": "2022-06-22T20:21:44.605794900Z",
"duration": "120.055171600s",
"created_at": "2022-06-22T20:21:44.517225200Z",
"updated_at": "2022-06-22T20:23:44.660965600Z"
},
"metadata": {
"is_parent_node": true,
"spec_node_id": "n1",
"is_dynamic": true
}
},
as expected
a
did you refresh the console after workflow completed?
k
yup and i can expand the dynamic task
a
lmk if you want to debug on our end
also, the original workflow had more tasks, (2 before and 2 after). I removed them as I didn't think they matter as we have another workflow with a single dynamic task that has the same problem, but it's possible the other one is constructed differently
other tasks are plain @task's
k
yeah that should be fine, the reporting should be on a per node basis so i don't think that materially changes things
@Alex Pozimenko if you run the modified workflow you shared with me, does that also fail to expand subtasks for you?
a
i haven't tried it
@katrina, here's a bare-bones workflow that can be used to repro the issue. I ran it and confirmed that subtasks don't show after completion
Copy code
import flytekit

@flytekit.dynamic(
    requests=flytekit.Resources(mem='256Mi', cpu='1'),
)
def simple_batch_task(iterations: int, input_string: str) -> None:
    for i in range(iterations):
        identity_sub_task(input_string=input_string)


@flytekit.task(
    requests=flytekit.Resources(mem='256Mi', cpu='1')
)
def identity_sub_task(input_string: str) -> str:
    return input_string


@flytekit.workflow
def HelloWorldDynamicTaskWorkflow(input_string: str = 'Hello World',
                                    iterations: int = 5):
    simple_batch_task(iterations=iterations, input_string=input_string)
k
this is really weird, using flytesandbox I can run this locally (refreshed after success) and the sub tasks drop downs still appear for me
it looks like you're on all the latest components so i wonder if this is a regression but nothing seems suspicious in recent changes
@Jason Porter would console v1.1.0 vs v1.1.1 make any difference here?
a
maybe some client side issue? I'm using Chrome Version 96.0.4664.55 (Official Build) (x86_64)
j
Hmm nothing obvious however technically all of the changes between those two version could potentially effect that view 😅. Let me have FE team take another look into this
k
hey @Alex Pozimenko argh so I upgraded my sandbox components to match the exact versions you're using and i still can't repro 🤯
a
odd... is there a flag in API response that enables the expand?
k
no it should be coming from that is_parent_node bit in the backend
i believe 😅
a
this is what I get:
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avexampleworkflows",
          "domain": "dev",
          "name": "alkhkhqw6qlxgrqq2lv9"
        }
      },
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avexampleworkflows-dev-alkhkhqw6qlxgrqq2lv9/start-node/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "created_at": "2022-06-23T20:37:45.792945109Z",
        "updated_at": "2022-06-23T20:37:45.792945109Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avexampleworkflows",
          "domain": "dev",
          "name": "alkhkhqw6qlxgrqq2lv9"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avexampleworkflows-dev-alkhkhqw6qlxgrqq2lv9/n0/data/inputs.pb>",
      "closure": {
        "phase": "SUCCEEDED",
        "started_at": "2022-06-23T20:37:45.909076702Z",
        "duration": "310.470520303s",
        "created_at": "2022-06-23T20:37:45.849547550Z",
        "updated_at": "2022-06-23T20:42:56.379596303Z"
      },
      "metadata": {
        "spec_node_id": "n0",
        "is_dynamic": true
      }
    },
    {
      "id": {
        "node_id": "end-node",
        "execution_id": {
          "project": "avexampleworkflows",
          "domain": "dev",
          "name": "alkhkhqw6qlxgrqq2lv9"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avexampleworkflows-dev-alkhkhqw6qlxgrqq2lv9/end-node/data/inputs.pb>",
      "closure": {
        "phase": "SUCCEEDED",
        "created_at": "2022-06-23T20:42:56.461800234Z",
        "updated_at": "2022-06-23T20:42:56.522386713Z"
      },
      "metadata": {
        "spec_node_id": "end-node"
      }
    }
  ]
}
k
interesting, is_dynamic is true, but not is_parent
this isn't cached right?
a
i don't think so
it has execution id in the url
so sounds like a backend issue (the response doesn't have is_parent)
and this is what i get while it's running:
Copy code
{
  "node_executions": [
    {
      "id": {
        "node_id": "start-node",
        "execution_id": {
          "project": "avexampleworkflows",
          "domain": "dev",
          "name": "a2n7dnlw76stjlpzqhq6"
        }
      },
      "closure": {
        "output_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avexampleworkflows-dev-a2n7dnlw76stjlpzqhq6/start-node/data/0/outputs.pb>",
        "phase": "SUCCEEDED",
        "created_at": "2022-06-23T23:56:16.891813217Z",
        "updated_at": "2022-06-23T23:56:16.891813217Z"
      },
      "metadata": {
        "spec_node_id": "start-node"
      }
    },
    {
      "id": {
        "node_id": "n0",
        "execution_id": {
          "project": "avexampleworkflows",
          "domain": "dev",
          "name": "a2n7dnlw76stjlpzqhq6"
        }
      },
      "input_uri": "<s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avexampleworkflows-dev-a2n7dnlw76stjlpzqhq6/n0/data/inputs.pb>",
      "closure": {
        "phase": "RUNNING",
        "started_at": "2022-06-23T23:56:17.014704942Z",
        "created_at": "2022-06-23T23:56:16.951710847Z",
        "updated_at": "2022-06-23T23:56:17.014704942Z"
      },
      "metadata": {
        "is_parent_node": true,
        "spec_node_id": "n0"
      }
    }
  ]
}
👀 1
on completion is_parent_node is replaced with is_dynamic
k
https://flyte-org.slack.com/archives/CNMKCU6FR/p1656028456973969?thread_ts=1655414807.481039&amp;cid=CNMKCU6FR this isn't an indication of being cached, are there any icons like ( ) that appear in the console ?
a
no icons on the console. I also ran new versions of the wf several times, with same consistent results
hi @katrina and @Haytham Abuelfutuh, happy Monday. Any thoughts on how we proceed from here?
h
Hey Alex, let me sync up with Katrina and follow up
👍 1
Hey @Alex Pozimenko Do you mind if we give you docker images for propeller and admin with additional logging to look into what’s going on?
k
for flyteadmin, can you update your deployment to use this image: ghcr.io/flyteorg/flyteadmin:v1.1.26-node-exec-logging
h
And this
<http://ghcr.io/flyteorg/flytepropeller:v1.1.15-patch1|ghcr.io/flyteorg/flytepropeller:v1.1.15-patch1>
for flytepropeller
a
thanks, will look into this. I actually haven't tried to repro this issue in our dev/scratch environment, so perhaps that's what I should try next 🙂
j
@eugene jahn
a
hey folks, sorry about the delay. I was able to repro the same issue in our dev environment. Next will switch the images as requested above. Is there anything specific I should be looking for?
h
If you can capture the logs from propeller and admin, that would be great.. happy to jump on a call to observe if you want
a
noticed this err in propeller log:
Copy code
{"json":{"exec_id":"a2xflpcg2x5zfpkfnlrk","ns":"dev","routine":"worker-8"},"level":"error","msg":"Failed to update workflow. Error [Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> \"a2xflpcg2x5zfpkfnlrk\": the object has been modified; please apply your changes to the latest version and try again]","ts":"2022-06-29T20:29:00Z"}
E0629 20:29:00.729217       1 workers.go:102] error syncing 'dev/a2xflpcg2x5zfpkfnlrk': Operation cannot be fulfilled on <http://flyteworkflows.flyte.lyft.com|flyteworkflows.flyte.lyft.com> "a2xflpcg2x5zfpkfnlrk": the object has been modified; please apply your changes to the latest version and try again
i see the same err in other environments too
h
cc @Yee
@Alex Pozimenko that isn’t a problem per se. We’re fixing it though but I think that’s independent… digging into the logs
a
sg
h
I think propeller’s logs are either cut off or log level is set to warnings only
👀 1
a
i don't see explicit log level in the pod spec.
so it's whatever the default for the container
h
if you add, to the config map, this:
Copy code
logger:
  level: 6
  show-source: true
You might see you already have “logger” there…
a
got it, just need to find where it is defined
there's no env var override?
h
You can also set $LOGGER_LEVEL=6 I believe (haven’t played with that in a while though)
a
no logger int he configmap
ok, so the updated config map should look like this?
Copy code
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: "prod"
data:
  propeller: |-
    propeller:
      logger:
        level: 6
        show-source: true
      kube-client-config:
        qps: 100
        burst: 50
        timeout: 30s
      rawoutput-prefix: "s3://${flyte_bucket_name}"
....
@Haytham Abuelfutuh
❤️ 1
h
oh
that looks the same 😞
a
yeah, i was going to say that too... does the configmap look right?
let me try the env var
h
oh no
it shouldn’t be under
propeller:
Copy code
apiVersion: v1
kind: ConfigMap
metadata:
  name: flyte-propeller-config
  namespace: "prod"
data:
  propeller: |-
    logger:
      level: 6
      show-source: true
    propeller:
      kube-client-config:
        qps: 100
        burst: 50
        timeout: 30s
      rawoutput-prefix: "s3://${flyte_bucket_name}"
....
👍 1
a
this looks right
h
Thanks @Alex Pozimenko I think that clarified things quite a bit. I believe I know what’s causing what you are seeing, I’m tracking down why it lands in that state though
🙏 1
Hey @Alex Pozimenko, We think we have a fix. Can you try this workaround for now? We want to set this config change. Can you modify flyte admin’s config to match this line? https://github.com/flyteorg/flyte/blob/d60da1662f3dfc616b1fdb72a323887399da1cb0/charts/flyte-core/values.yaml#L489
Copy code
flyteadmin:
  eventVersion: 2
k
alternatively you can use ghcr.io/flyteorg/flyteadmin:v1.1.26-node-exec-event-version which hard codes the event version
a
on it
trying the config change now
the workaround worked
shall i apply the same in prod?
k
awesome, yes please do!
a
are there any side effects or anything we should keep an eye on after making the change?
k
not really. the change only affects caching the dynamic workflow closure which is produced dynamically at run time and is solely a performance optimization. the version bump has been out for our and other users' deployments for a long time and won't affect any in-progress workflows
a
ok, thanks
172 Views