Brian Tang

    Brian Tang

    4 months ago
    👋 I’m seeing a discrepancy in Flyte datacatalog between the docs and my testing: • In the docs — it says the cache is only invalidated if the
    cache_version
    , task signature, or inputs change. This aligns with the example provided:
    In the above example, calling square(n=2) twice (even if it’s across different executions or different workflows) will only execute the multiplication operation once.
    • However, in my testing — caching a task and then making a change to a different task in the workflow causes the “cached” task to be recomputed. The code shows the hashed key contains a
    core.Identifier
    that has the task version in it. Logs are also showing the same, which would mean each iteration on the workflow would recompute the “cached” task:
    "Successfully cached results to catalog - Task [resource_type:TASK project:\"fraud-intelligence\" domain:\"development\" name:\"src.python.flyte.fraud_intelligence.training_set.main.filter_and_label\" version:\"e8758db3b7a1e6fffbb0bb73c742310acd7e774f\" ]"
    Are the docs just outdated? If the implementation is correct, how are cached tasks expected to be reused when iterating on a workflow? Is the only way to reuse a cached task through a reference task?
    Yee

    Yee

    4 months ago
    what’s your task signature? could you copy/paste it here?
    it shouldn’t take into account task version… that should be okay to change frequently
    Brian Tang

    Brian Tang

    4 months ago
    @task(cache=True, cache_version="v1")
    def filter_and_label(
        user: str,
        snapshot_date: str,
        start_date: str,
        end_date: str,
        label_type: str,
        final_features_path: str,
        output_path: str,
    ) -> str:
    Yee

    Yee

    4 months ago
    if you re-run a workflow at the same version, does the cache get read?
    Brian Tang

    Brian Tang

    4 months ago
    yep
    The first one is a rerun of the same workflow, the 2nd is a change in a different task that is causing the first “cached” task to be recomputed
    Yee

    Yee

    4 months ago
    and if you go to the task json definition, the
    discoveryVersion
    is what you expect in all cases?
    Brian Tang

    Brian Tang

    4 months ago
    $ flytectl get tasks -d development -p fraud-intelligence src.python.flyte.fraud_intelligence.training_set.main.filter_and_label -oyaml
    - closure:
        compiledTask:
          template:
    ...
            metadata:
              discoverable: true
              discoveryVersion: v1
    yeah that task hasn’t changed — so that metadata should be consistent
    Yee

    Yee

    4 months ago
    oh… sorry, the json is also on the Task tab in the UI.
    Brian Tang

    Brian Tang

    4 months ago
    I traced the logs and the issue is the cache key is different between the runs. The original run has a key of
    flyte_cached-goqzg39XfX_GSwutxjbTzJghG38yCEerd52cCCV6zzA
    and running an updated workflow is showing
    "DataCatalog failed to get artifact by tag flyte_cached-2zsk_u8ljbfKgGhocfVu4Cmm6aHxU8YfO63yFx92duk"
    Yee

    Yee

    4 months ago
    where did you see the
    core.Identifier
    you were talking about?
    Brian Tang

    Brian Tang

    4 months ago
    it’s in the github code link I provided above —
    catalog.Key
    Ah yes. that is much easier to see the task metadata 🙂
    "type": "python-task",
      "metadata": {
        "discoverable": true,
        "runtime": {
          "type": 1,
          "version": "0.26.0",
          "flavor": "python"
        },
        "retries": {},
        "discoveryVersion": "v1"
      },
    looks consistent between both of them
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    4 months ago
    Hey @Brian Tang it looks like the tag where the hash is different is computed as a hash over the tasks input values. So it shouldn't be an issue of task version if those are different. Can you show the inputs tabs in the UI from the different task runs?
    Brian Tang

    Brian Tang

    4 months ago
    oh crud —
    end_date:
    2022-04-26
    snapshot_date:
    2022-04-26
    start_date:
    2022-04-25
    user:
    btang
    i forgot one of the inputs was a date, that was computed by current
    lemme try it again
    😬 yep my mistake. The cache key is consistent!
    thanks for the help!
    Dan Rammer (hamersaw)

    Dan Rammer (hamersaw)

    4 months ago
    No problem! thanks @Yee! Looks like you took a pretty deep dive, hope it wasn't too painful 😅
    Ketan (kumare3)

    Ketan (kumare3)

    4 months ago
    @Brian Tang help us understand what ux can make this process better?
    cc @Hongxun / @Jason Porter
    Brian Tang

    Brian Tang

    4 months ago
    This was mostly user error on my part 😅, but perhaps it would be helpful if the logs also included the input names/types that are part of the cache key. Reading
    "DataCatalog failed to get dataset for ID resource_type:TASK project:\"fraud-intelligence\" domain:\"development\" name:\"src.python.flyte.fraud_intelligence.training_set.main.filter_and_label\" version:\"e8758db3b7a1e6fffbb0bb73c742310acd7e774f\"
    at first glance made me believe those were all a part of the cache key