Hello, Flyte Team! I have a question about flyte retry policy. I have checked flyte re-run failed ...

SeungTaeKim

Hello, Flyte Team! I have a question about flyte retry policy. I have checked flyte re-run failed task in the same workflow. But tasks re-return in a long term, for example, the failed task run after a day. some succeeded task is also same. running again after a day or more than 10 hours. Does anyone know why flyte task run with a long term interval?

Hello all I am interested in knowing the resources (CPU, GPU and Memory) requested by a task. Can yo...

Nischel Kandru (Woven Planet)

almost 3 years ago

Hello all I am interested in knowing the resources (CPU, GPU and Memory) requested by a task. Can you please let me know which table-field in Flyte Admin Db contains this data?

[need help] I'm trying to setup a gpu slurm cluster. this is the last 2 lines in my `/etc/slurm/slur...

L godlike

8 months ago

[need help] I'm trying to setup a gpu slurm cluster. this is the last 2 lines in my

/etc/slurm/slurm.conf

NodeName=localhost Gres=gpu:1 CPUs=4 RealMemory=15006 Sockets=1 CoresPerSocket=2 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

this is the

/etc/slurm/gres.conf

AutoDetect=nvml
NodeName=localhost Name=gpu Type=tesla  File=/dev/nvidia0 COREs=0

after changed the config, I restarted my slurm cluster and type

slurmd -C

but it doesn't show that I have gpu. CC @Adrian Loy @Bernd Doser @Buğra Gedik

I was previously able to access the console and execute workflows on a AWS eks deployment. I had to ...

Cody Scandore

over 2 years ago

I was previously able to access the console and execute workflows on a AWS eks deployment. I had to redeploy yesterday, and I am able to access the console but

pyflyte

executions are failing.

Failed with Exception Code: SYSTEM:Unknown
RPC Failed, with Status: StatusCode.INTERNAL
	details: failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
	status code: 403, request id: 5efc9c88-fdcb-42ab-bea8-8de7a79101e9
	Debug string UNKNOWN:Error received from peer  {grpc_message:"failed to create a signed url. Error: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 5efc9c88-fdcb-42ab-bea8-8de7a79101e9", grpc_status:13, created_time:"2023-06-14T11:16:05.730571-07:00"}

what am i missing here?

Rezwan Abir

over 2 years ago

what am i missing here?

anyone knows how to change imagePullPolicy of a task or workflow?

Nan Qin

over 2 years ago

anyone knows how to change imagePullPolicy of a task or workflow?

How to use s3 with flyte spark? Spark with flyte itself works fine with the spark serviceaccount (sp...

Aleksander Lempinen

over 2 years ago

How to use s3 with flyte spark? Spark with flyte itself works fine with the spark serviceaccount (spark-pi example from the flyte docs) and I tested the spark serviceaccount using a spark task that simply uses boto3. But I can't seem to make something like spark.read.parquet("s3://<bucket>/<path to file>) to work. I tested it with pyspark using the workflow dockerfile and it works if I do aws configure first.

An error occurred while calling o125.parquet.
: java.nio.file.AccessDeniedException: s3://<bucket>/<path>: getFileStatus on s3://<bucket>/<path>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;

My spark config in values.yaml is:

spark-config-default:
            # We override the default credentials chain provider for Hadoop so that
            # it can use the serviceAccount based IAM role or ec2 metadata based.
            # This is more in line with how AWS works
            - spark.hadoop.fs.s3a.aws.credentials.provider: "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"
            - spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: "2"
            - spark.kubernetes.allocation.batch.size: "50"
            - spark.hadoop.fs.s3a.acl.default: "BucketOwnerFullControl"
            - spark.hadoop.fs.s3n.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3n.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.impl: "org.apache.hadoop.fs.s3a.S3AFileSystem"
            - spark.hadoop.fs.AbstractFileSystem.s3a.impl: "org.apache.hadoop.fs.s3a.S3A"
            - spark.hadoop.fs.s3a.multipart.threshold: "536870912"
            - spark.blacklist.enabled: "true"
            - spark.blacklist.timeout: "5m"
            - spark.task.maxfailures: "8"

I can specify the AWS credentials manually and use "spark.hadoop.fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" in the spark config, but I was wondering if you can use IRSA to avoid using credentials.

Hi! I’m trying to sync an execution using `FlyteRemote.sync` but the response is too large: ```grpc....

Anna Cunningham

about 3 years ago

Hi! I’m trying to sync an execution using

FlyteRemote.sync

but the response is too large:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.RESOURCE_EXHAUSTED
        details = "Received message larger than max (4762259 vs. 4194304)"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:192.168.3.75:81 {grpc_message:"Received message larger than max (4762259 vs. 4194304)", grpc_status:8, created_time:"2022-08-23T23:23:18.247266112+00:00"}"

Is there a way for me to adjust the max amount or somehow still get my synced execution data?

Anyone have a good way of testing the grpc end points (maybe with grpcurl) to debug networking issue...

Katrina P

about 3 years ago

Anyone have a good way of testing the grpc end points (maybe with grpcurl) to debug networking issues? I upped the admin logging level, and hoping to see more logs to see if we're hitting admin correctly but can't tell if there's just no logs or if we're not hitting it 😅

How can I get the status of a list of tasks running in a dynamic task using `flytekit.remote`? I hav...

Nicholas Roberson

over 2 years ago

How can I get the status of a list of tasks running in a dynamic task using

flytekit.remote

? I have the execution information and can sync it for the workflow, however I want to be able to give a user in a CLI a command they can run to check the status of a large job (amount of tasks in RUNNING, SUCCEEDED, FAILED, etc..)

Previous 333435 Next

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.