Nicholas LoFaso
08/31/2022, 1:33 AMFailed to record taskEvent, error [EventAlreadyInTerminalStateError: conflicting events; destination: ABORTED, caused by [rpc error: code = FailedPrecondition desc = invalid phase change from FAILED to ABORTED for task execution
i feel like it is related to this ticket. My main question however is
When something like this happens what is the best way to reset flytepropeller? I tried restarting the pod but it seems like is a problem with the database state?Shahwar Saleem
08/31/2022, 6:28 PMhelm install
?
Error: file '/Users/shahwar.saleem/Library/Caches/helm/repository/flyte-core-v1.1.0.tgz' does not appear to be a gzipped archive; got 'application/octet-stream'
KS Tarun
09/01/2022, 5:33 PMFelix Ruess
09/01/2022, 8:57 PMSmriti Satyan
09/02/2022, 10:16 AMShivay Lamba
09/05/2022, 6:21 AMKS Tarun
09/05/2022, 12:42 PMGeoff Salmon
09/08/2022, 5:39 PMJustin Tyberg
09/09/2022, 2:50 PMflytectl register
from a CI workflow, with Flyte auth enabled.
We’re on GCP/GKE.
Ideally, I’d like to register files from a pod running on the GKE cluster. I think this would use the OAuth2 flow that flyte components use?
Alternatively, we could run flytectl register
from the CI runner, as this doc implies. I’d love to find an example of doing this.John Lawlor
09/14/2022, 4:42 PMflytectl sandbox start --source . --imagePullPolicy IfNotPresent
flytectl sandbox exec -- docker build -t new_image .
When I run
flytectl sandbox exec -- docker images
I see my image inside of the flyte-sandbox container.
REPOSITORY TAG IMAGE ID CREATED SIZE
new_image latest 044c6f6eb860 44 minutes ago 1.49GB
Within my @task
, I reference new_image
. When I go to run the workflow remotely, I receive the following error after about 30 seconds:
[1/1] currentAttempt done. Last Error: USER::containers with unready status: [ffdb7fdd18e544b99a24-n0-0]|Back-off pulling image "new_image:latest"
Any thoughts?Fredrick
09/14/2022, 10:17 PMkubectl logs flyte-pod-webhook-57cff5dd75-w596k
Defaulted container "webhook" out of: webhook, generate-secrets (init)
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="App [flytepropeller], Version [unknown], BuildSHA [unknown], BuildTS [2022-09-14 22:11:09.195785843 +0000 UTC m=+0.480189926]"
time="2022-09-14T22:11:09Z" level=info msg=------------------------------------------------------------------------
time="2022-09-14T22:11:09Z" level=info msg="Detected: 64 CPU's\n"
{"json":{},"level":"fatal","msg":"Failed to create controller manager. Error: failed to initialize controller-runtime manager: error listening on :8080: listen tcp :8080: bind: address already in use","ts":"2022-09-14T22:11:13Z"}
Jacob Wang
09/15/2022, 10:06 AMArmaan Goel
09/17/2022, 4:31 PMpyflyte run --remote core/flyte_basics/hello_world.py my_wf
after setting up a cluster on gcloud, I think it has something to do with my SSL setup being incorrect?Rahul Mehta
09/18/2022, 11:05 PMflytectl diff
feature (or does a version of this functionality already exist somewhere)? We'd like to implement a custom controller for Flux CD to manage Flyte project/domain/tra/cra config in a declarative manner that's consistent w/ the rest of our gitops infra.
Based on this example, if diff
was supported this seems like it'd be fairly straightforward. Curious if anyone else has thoughts/if others are applying gitops patterns to managing Flyte resourcesArshak Ulubabyan
09/20/2022, 1:37 PMuserSettings:
...
logGroup: /aws/eks/analytics-orchestration/flyte
...
configmap:
task_logs:
plugins:
logs:
kubernetes-enabled: false
# -- One option is to enable cloudwatch logging for EKS, update the region and log group accordingly
# You can even disable this
cloudwatch-enabled: true
# -- region where logs are hosted
cloudwatch-region: "{{ .Values.userSettings.accountRegion }}"
# -- cloudwatch log-group
cloudwatch-log-group: "{{ .Values.userSettings.logGroup }}"
Ken Leidal
09/20/2022, 3:26 PMgcloud iam service-accounts add-iam-policy-binding --role "roles/iam.workloadIdentityUser" --member "serviceAccount:${PROJECT_ID}.svc.id.goog[flyte/flyteadmin]" gsa-flyteadmin@${PROJECT_ID}.<http://iam.gserviceaccount.com|iam.gserviceaccount.com>
I’m getting:
ERROR: (gcloud.iam.service-accounts.add-iam-policy-binding) INVALID_ARGUMENT: Identity Pool does not exist (${PROJECT_ID}.svc.id.goog). Please check that you specified a valid resource name as returned in the `name` attribute in the configuration API.
(PROJECT_ID being redacted in my message, but it’s the actual PROJECT_ID in the real log message).
Do I need to create the k8s cluster and enable workload identity pools on it before running this command?
---
PS: reading this thread with @Armaan Goel, yeah it looks like there is a problem with the order in the deployment manual, and I’ll have to launch the GKE cluster first.Armaan Goel
09/20/2022, 10:44 PMSanjay Chouhan
09/22/2022, 6:58 AMadmin:
# For GRPC endpoints you might want to use dns:///flyte.myexample.com
endpoint: dns:///a4ad903c61##################.<http://us-west-1.elb.amazonaws.com:80|us-west-1.elb.amazonaws.com:80>
authType: Pkce
insecure: true
logger:
show-source: true
level: 0
storage:
connection:
access-key: minio
auth-type: accesskey
disable-ssl: true
endpoint: <http://a093eb##############.us-west-1.elb.amazonaws.com:9001/>
region: us-east-1
secret-key: miniostorage
type: minio
container: "my-s3-bucket"
enable-multicontainer: true
Why it's checking in localhost when I have mentioned the AWS ingress URL?Shahwar Saleem
09/28/2022, 9:09 PMUNKNOWN
state. I was wondering what could be possible cause for this?
Do we need to inform any service other than Flyte Admin about the Auth? For Example Flyte Propeller?
CC: @Prafulla MahindrakarJacob Wang
09/29/2022, 11:52 AMAndrew Achkar
09/29/2022, 4:17 PMHanno Küpers
09/30/2022, 10:35 AMFredrick
10/03/2022, 1:53 PMflytepropeller-7476486c6c-rtkp7 flytepropeller E1002 21:16:23.580394 1 workers.go:102] error syncing 'maps-flyte/a2lmztn5z
v6tt26ddk8k': failed at Node[n0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [[deny-capabilities] container <a2lmztn5zv6tt26ddk8k-n0-0> has a denied capability. Denied capabilities are ["DAC_READ_SEARCH", "NET_ADMIN", "SYS_ADMIN", "SYS_MODULE", "SYS_PTRACE", "DAC_OVERRIDE", "FOWNER", "KILL", "MKNOD", "NET_BIND_
SERVICE", "NET_RAW", "SETFCAP", "SETGID"]] failed to create resource, caused by: admission webhook "validation.gatekeeper.sh" denied the request: [deny-capabilities] container <a2lmztn5zv6tt26ddk8k-n0-0> has a denied capability. Denied capabilities are ["DAC_READ_SEARCH", "NET_ADMIN", "SYS_ADMIN", "SYS_MODULE", "SYS_PTRACE", "DAC_OVERRIDE", "FOWNER", "KILL", "MKNOD", "NET_BIND_SERV
ICE", "NET_RAW", "SETFCAP", "SETGID"]
Open AIMP
10/04/2022, 3:32 AMAttributeError: module 'torch' has no attribute 'nn'
Zachary Kimble
10/04/2022, 8:42 PMcra.yaml
will be applied in k8s via the resourcequota
on the project-domain namespace. Is that not the case?
After updating cluster-resources-attribute
, I'm seeing following mismatch:
flytectl get cluster-resource-attribute -p flytesnacks -d development
returns
{
"project": "flytesnacks",
"domain": "development",
"attributes": {
"projectQuotaCpu": "10",
"projectQuotaMemory": "14G"
}
}
but
kubectl -n flytesnacks-development get resourcequotas
returns
NAME AGE REQUEST LIMIT
project-quota 142m limits.cpu: 0/4, limits.memory: 0/3000Mi
Dan Rammer (hamersaw)
10/05/2022, 2:39 PMKetan (kumare3)
Rahul Mehta
10/07/2022, 6:51 PMMatheus Moreno
10/10/2022, 4:11 PMhelm repo add flyteorg <https://helm.flyte.org>
and then helm pull --untar flyteorg/flyte-core
, I get a chart of version 1.2.0. But on GitHub the latest version is 1.1.0. I think I'm looking at the wrong place. Where are the updated charts? Are they dynamically generated?
By the way, the README of flyte-core 1.2.0 chart is wrong. For some reason, the README for the flyte-deps chart is being shown.Ekku Jokinen
10/13/2022, 1:26 PM$ kubectl -n flyte get ingress
NAME CLASS HOSTS ADDRESS PORTS AGE
flyte-core <none> * 80 41m
flyte-core-grpc <none> * 80 41m
This issue was mentioned in the troubleshooting section of the docs, and it suggested running
$ kubectl describe ingress -n flyte
Name: flyte-core
Labels: <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace: flyte
Address:
Ingress Class: <none>
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
*
/* ssl-redirect:use-annotation (<error: endpoints "ssl-redirect" not found>)
...
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
<http://alb.ingress.kubernetes.io/certificate-arn|alb.ingress.kubernetes.io/certificate-arn>: arn:aws:acm:us-east-1:752578504353:certificate/7a7065cb-1ffc-418e-8070-bc36fbaff7cb
<http://alb.ingress.kubernetes.io/group.name|alb.ingress.kubernetes.io/group.name>: flyte
<http://alb.ingress.kubernetes.io/listen-ports|alb.ingress.kubernetes.io/listen-ports>: [{"HTTP": 80}, {"HTTPS":443}]
<http://alb.ingress.kubernetes.io/scheme|alb.ingress.kubernetes.io/scheme>: internet-facing
<http://alb.ingress.kubernetes.io/tags|alb.ingress.kubernetes.io/tags>: service_instance=production
<http://kubernetes.io/ingress.class|kubernetes.io/ingress.class>: alb
<http://meta.helm.sh/release-name|meta.helm.sh/release-name>: flyte
<http://meta.helm.sh/release-namespace|meta.helm.sh/release-namespace>: flyte
<http://nginx.ingress.kubernetes.io/app-root|nginx.ingress.kubernetes.io/app-root>: /console
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDeployModel 3m33s (x20 over 42m) ingress Failed deploy model due to InvalidParameter: 1 validation error(s) found.
- minimum field value of 1, CreateTargetGroupInput.Port.
Name: flyte-core-grpc
Labels: <http://app.kubernetes.io/managed-by=Helm|app.kubernetes.io/managed-by=Helm>
Namespace: flyte
Address:
Ingress Class: <none>
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
...
Annotations: <http://alb.ingress.kubernetes.io/actions.ssl-redirect|alb.ingress.kubernetes.io/actions.ssl-redirect>:
{"Type": "redirect", "RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedDeployModel 3m36s (x19 over 42m) ingress Failed deploy model due to InvalidParameter: 1 validation error(s) found.
- minimum field value of 1, CreateTargetGroupInput.Port.
Could someone point me to the right direction in debugging this? I checked the security groups of EKS cluster and RDS, they were the same. Thanks in advance.
Best,
Ekku, CTO & Co-Founder @ inven.ai