<@UPBBNMXD1> <@UNZB4NW3S> I am having more serious...
# flytekit
a
@katrina @Ketan (kumare3) I am having more serious blocker problems with k8s service accounts. I'm on flytekit and flyteidl 1.0.1, propeller 1.1.3 and flyteadmin 1.1.4. Running from the console or with
flytectl create execution
is fine, but now scheduled launch plans ignore the registered k8s service account and run as
default
. My scheduled launch plan looks ok:
Copy code
FA21110289:avexampleworkflows alex.bain$ cat lp.json 
{
	"id": {
		"resourceType": "LAUNCH_PLAN",
		"project": "avexampleworkflows",
		"domain": "dev",
		"name": "schedule_fabrik_parent_workflow_101",
		"version": "31599fdeaaf34476ba54155426ae0707709300bb"
	},
	"spec": {
		"workflowId": {
			"resourceType": "WORKFLOW",
			"project": "avexampleworkflows",
			"domain": "dev",
			"name": "app.workflows.fabrik.fabrik_parent_workflow.fabrik_parent_workflow",
			"version": "31599fdeaaf34476ba54155426ae0707709300bb"
		},
		"entityMetadata": {
			"schedule": {
				"cronExpression": "*/15 * * * ? *"
			}
		},
		"defaultInputs": {},
		"fixedInputs": {},
		"labels": {},
		"annotations": {},
		"authRole": {
			"kubernetesServiceAccount": "avexampleworkflows"
		},
		"rawOutputDataConfig": {
			"outputLocationPrefix": "<s3://lyft-av-prod-pdx-flyte/raw_data>"
		}
	},
	"closure": {
		"state": "ACTIVE",
		"expectedInputs": {},
		"expectedOutputs": {},
		"createdAt": "2022-06-04T03:01:52.526120Z",
		"updatedAt": "2022-06-04T03:01:52.526120Z"
	}
}
I kind of thought there would be a
securityContext
entry in the scheduled launch plan, but I don't see one. Anyways, I created this ^^^^^^ with:
Copy code
schedule_fabrik_parent_workflow_101 = flytekit.LaunchPlan.get_or_create(
    name="schedule_fabrik_parent_workflow_101",
    schedule=flytekit.CronSchedule(cron_expression="*/15 * * * ? *"),
    workflow=fabrik_parent_workflow,
)
All of this was registered with
flytectl register files --k8sServiceAccount avexampleworkflows
. Then I did
flytectl update launchplan --admin.endpoint <http://avflyteadmin.scratch-alexbain.dev|avflyteadmin.scratch-alexbain.dev>.l5.woven-planet.tech:443 -p avexampleworkflows -d dev schedule_fabrik_parent_workflow_101 --version 31599fdeaaf34476ba54155426ae0707709300bb --activate
to activate this launch plan (so that it would start executing every 15 minutes. Ah, here is the execution. You can see that the
securityContext
is broken in the execution. This is definitely a bug then as it ignored the declared
kubernetesServiceAccount
on the actual launch plan. I ran `flytectl get execution --admin.endpoint avflyteadmin.scratch-alexbain.dev.l5.woven-planet.tech:443 -p avexampleworkflows -d dev at89nsj2879cfl6l2cbf -o json`:
Copy code
{
  "id": {
    "project": "avexampleworkflows",
    "domain": "dev",
    "name": "at89nsj2879cfl6l2cbf"
  },
  "spec": {
    "launchPlan": {
      "resourceType": "LAUNCH_PLAN",
      "project": "avexampleworkflows",
      "domain": "dev",
      "name": "schedule_fabrik_parent_workflow_101",
      "version": "31599fdeaaf34476ba54155426ae0707709300bb"
    },
    "metadata": {
      "mode": "SCHEDULED",
      "scheduledAt": "2022-06-04T03:15:00Z",
      "systemMetadata": {}
    },
    "securityContext": {
      "runAs": {
        "k8sServiceAccount": "default"
      }
    }
  },
  "closure": {
    "error": {
      "code": "RetriesExhausted|UnknownError",
      "message": "[1/1] currentAttempt done. Last Error: USER::Pod failed. No message received from kubernetes.\r\n[at89nsj2879cfl6l2cbf-n0-0-n0-0] terminated with exit code (2). Reason [Error]. Message: \nation: '\n+ echo 'L5 data center: pdx'\n+ echo 'L5 domain: dev.l5.woven-planet.tech'\n+ echo 'L5 cluster name: scratch-alexbain'\nIn the avcontainers flyte-spark-entrypoint script\nL5 application: \nL5 data center: pdx\nL5 domain: dev.l5.woven-planet.tech\nL5 cluster name: scratch-alexbain\nL5 namespace: dev\n[INFO] Vault address: <https://vault.pdx.dev.l5.woven-planet.tech>\n[INFO] Flyte internal domain: dev\n[INFO] Flyte internal project: avexampleworkflows\n[INFO] Flyte internal execution project: avexampleworkflows\n+ echo 'L5 namespace: dev'\n+ export VAULT_ADDR=<https://vault.pdx.dev.l5.woven-planet.tech>\n+ VAULT_ADDR=<https://vault.pdx.dev.l5.woven-planet.tech>\n+ echo '[INFO] Vault address: <https://vault.pdx.dev.l5.woven-planet.tech>'\n+ echo '[INFO] Flyte internal domain: dev'\n+ echo '[INFO] Flyte internal project: avexampleworkflows'\n+ echo '[INFO] Flyte internal execution project: avexampleworkflows'\n+ role=avexampleworkflows-dev\n+ echo '[INFO] Role: avexampleworkflows-dev'\n+ SERVICE_ACCOUNT_TOKEN_PATH=/var/run/secrets/kubernetes.io/serviceaccount/token\n+ [[ ehxBE =~ x ]]\n+ debug=1\n+ set +x\n[INFO] Role: avexampleworkflows-dev\n+ AWS_SECRET_PREFIX=level5\n+ PROD_ENV_REGEX='[a-zA-Z]*prod[a-zA-Z]*'\n+ SCRATCH_ENV_REGEX='scratch[a-zA-Z]*'\n+ SECRET_PREFIX=level5\n+ [[ scratch-alexbain =~ scratch[a-zA-Z]* ]]\n+ AWS_SECRET_PREFIX=level5/scratch-alexbain\n+ SECRET_PREFIX=level5/scratch-alexbain\n+ [[ avexampleworkflows-dev =~ [a-zA-Z]*prod[a-zA-Z]* ]]\n+ AWS_SECRET_PREFIX=level5/scratch-alexbain/dev/aws/sts\n+ SECRET_PREFIX=level5/scratch-alexbain/dev/flyte/sts\n+ echo '[INFO] AWS secret prefix: level5/scratch-alexbain/dev/aws/sts'\n+ echo '[INFO] Secret prefix: level5/scratch-alexbain/dev/flyte/sts'\n+ set +x\n[INFO] AWS secret prefix: level5/scratch-alexbain/dev/aws/sts\n[INFO] Secret prefix: level5/scratch-alexbain/dev/flyte/sts\nError writing data to auth/scratch-alexbain/login: Error making API request.\n\nURL: PUT <https://vault.pdx.dev.l5.woven-planet.tech/v1/auth/scratch-alexbain/login>\nCode: 500. Errors:\n\n* service account name not authorized\n.",
      "kind": "USER"
    },
    "phase": "FAILED",
    "startedAt": "2022-06-04T03:15:13.228779416Z",
    "duration": "49.957632481s",
    "createdAt": "2022-06-04T03:15:12.947090574Z",
    "updatedAt": "2022-06-04T03:16:03.186411481Z",
    "workflowId": {
      "resourceType": "WORKFLOW",
      "project": "avexampleworkflows",
      "domain": "dev",
      "name": "app.workflows.fabrik.fabrik_parent_workflow.fabrik_parent_workflow",
      "version": "31599fdeaaf34476ba54155426ae0707709300bb"
    },
    "stateChangeDetails": {
      "occurredAt": "2022-06-04T03:15:12.947090574Z"
    }
  }
}
k
Cc @Prafulla Mahindrakar can you please look into this. This should not happen
Praful, why does flytectl still use authrole instead of security context
a
Lol praying we can solve this with a
flytectl
fix... that's actually the one easy update we can make org wide!
k
Ohh ya it is flytectl or FlyteAdmin
It's nothing to do with flytekit
🙏 1
a
Actually the only one that is a massive blocker is a flytekit update
Ketan and at @Prafulla Mahindrakar could you Slack me whenever you have a fix? Would love to get it before the end of next week (although I don't need it Mon/Tue as I am OOTO then).
k
That we absolutely understand the design is to ensure that is slowest. I know when you upgraded from .26 -1.0 it failed. Was a big upgrade
Cc @Alex Bain ideally this should be security context- thought assumeable Iam role should also work https://github.com/flyteorg/flytectl/blob/bff8cf193fd6b7dd8c6fe3ee28aeb5e241a31f0e/cmd/register/register_util.go#L403
What do you prefer - admin or flytectl fix
p
Have created fix for flytectl here https://github.com/flyteorg/flytectl/pull/328 .
Will take a look at admin issue too .Will check if there are other places if we are using deprecated fields.
Found the issue. Will send out the fix for flyteadmin
k
But the flytectl fix should unblock Alex for now right
And admin will fix things retroactively
p
yes thats right.
Hi @Alex Bain, the introduction of default values and deep value checks were causing some of backward compat code to not get executed. Also there were few redundant attributes that were causing some confusion. I have fixed the issue in the following image ghcr.io/flyteorg/flyteadmin:backward-compat-fix . heres the PR for it https://github.com/flyteorg/flyteadmin/pull/439/files Would be great if you can help with verifying for your workflow. And also this issue is not specific to scheduled workflows and can happen on regular workflows too. (FLyteconsole and flytectl create execution handled the backward compatibility by sending the execution parameters in the new security context field) In case of scheduled workflow since its not an immediate exeuction , it relies on the spec to provide the right execution parameters which got fixed by the flytectl change here https://github.com/flyteorg/flytectl/pull/328 But the flyteadmin fixes additionally handles cases where older workflows or like in your cases launchplans with only authRole being set and require to be launched without fetching the spec and passing it in eg : what the scheduler executor does https://github.com/flyteorg/flyteadmin/blob/master/scheduler/executor/executor_impl.go#L69-L91 It expects all service account related stuff to be present in the launchplan and not explicitly passed while calling from the scheduler.
167 Views