Hello Team ! I recently implemented flyte on Hetzn...
# flyte-support
m
Hello Team ! I recently implemented flyte on Hetzner cloud based kubernetes cluster built over k3s. I wanted to implement spark based tasks and can see clusters being initiated but all spark based clusters fail with following error log :
Copy code
++ id -u
+ myuid=1000
++ id -g
+ mygid=1000
+ set +e
++ getent passwd 1000
+ uidentry=flytekit:x:1000:1000::/home/flytekit:/bin/bash
+ set -e
+ '[' -z flytekit:x:1000:1000::/home/flytekit:/bin/bash ']'
+ '[' -z /usr/local/openjdk-11 ']'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z x ']'
+ export PYSPARK_PYTHON
+ '[' -z x ']'
+ export PYSPARK_DRIVER_PYTHON
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.244.3.55 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///usr/local/bin/entrypoint.py pyflyte-fast-execute --additional-distribution <s3://flyte/flytesnacks/development/AFUU2MUU5IGWXVCGNU4USDB6WU======/script_mode.tar.gz> --dest-dir . -- pyflyte-execute --inputs <s3://flyte/metadata/propeller/flytesnacks-development-asgp25xdskrs4btpqxmt/n0/data/inputs.pb> --output-prefix <s3://flyte/metadata/propeller/flytesnacks-development-asgp25xdskrs4btpqxmt/n0/data/0> --raw-output-data-prefix <s3://flyte/data/eu/asgp25xdskrs4btpqxmt-n0-0> --checkpoint-path <s3://flyte/data/eu/asgp25xdskrs4btpqxmt-n0-0/_flytecheckpoints> --prev-checkpoint '""' --resolver flytekit.core.python_auto_container.default_task_resolver -- task-module hellospark task-name hello_spark
24/09/01 04:59:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/data_persistence.py", line 560, in get_data
    self.get(remote_path, to_path=local_path, recursive=is_multipart, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/data_persistence.py", line 122, in retry_request
    raise e
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/data_persistence.py", line 115, in retry_request
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/data_persistence.py", line 296, in get
    dst = file_system.get(from_path, to_path, recursive=recursive, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py", line 103, in sync
    raise return_result
  File "/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py", line 56, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py", line 634, in _get
    rpaths = [
  File "/usr/local/lib/python3.9/dist-packages/fsspec/asyn.py", line 635, in <listcomp>
    p for p in rpaths if not (trailing_sep(p) or await self._isdir(p))
  File "/usr/local/lib/python3.9/dist-packages/s3fs/core.py", line 1483, in _isdir
    return bool(await self._lsdir(path))
  File "/usr/local/lib/python3.9/dist-packages/s3fs/core.py", line 723, in _lsdir
    async for c in self._iterdir(
  File "/usr/local/lib/python3.9/dist-packages/s3fs/core.py", line 755, in _iterdir
    s3 = await self.get_s3(bucket)
  File "/usr/local/lib/python3.9/dist-packages/s3fs/core.py", line 353, in get_s3
    return await self._s3creator.get_bucket_client(bucket)
  File "/usr/local/lib/python3.9/dist-packages/s3fs/utils.py", line 39, in get_bucket_client
    response = await general_client.head_bucket(Bucket=bucket_name)
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/client.py", line 394, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/client.py", line 420, in _make_request
    return await self._endpoint.make_request(
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/endpoint.py", line 96, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/endpoint.py", line 84, in create_request
    await self._event_emitter.emit(
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/hooks.py", line 66, in _emit
    response = await resolve_awaitable(handler(**kwargs))
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/_helpers.py", line 6, in resolve_awaitable
    return await obj
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/signers.py", line 24, in handler
    return await self.sign(operation_name, request)
  File "/usr/local/lib/python3.9/dist-packages/aiobotocore/signers.py", line 90, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.9/dist-packages/botocore/auth.py", line 423, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/entrypoint.py", line 620, in <module>
    _pass_through()
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/bin/entrypoint.py", line 547, in fast_execute_task_cmd
    _download_distribution(additional_distribution, dest_dir)
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/utils.py", line 308, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/flytekit/tools/fast_registration.py", line 151, in download_distribution
    FlyteContextManager.current_context().file_access.get_data(
  File "/usr/local/lib/python3.9/dist-packages/flytekit/core/data_persistence.py", line 564, in get_data
    raise FlyteAssertion(
flytekit.exceptions.user.FlyteAssertion: USER:AssertionError: error=Failed to get data from <s3://flyte/flytesnacks/development/AFUU2MUU5IGWXVCGNU4USDB6WU======/script_mode.tar.gz> to ./ (recursive=False).

Original exception: Unable to locate credentials
24/09/01 04:59:40 INFO ShutdownHookManager: Shutdown hook called
24/09/01 04:59:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-7412c322-30de-459f-a770-ace11f1c698e
Here's the kubectl descibe for the spark driver pod :
Copy code
Name:             asgp25xdskrs4btpqxmt-n0-0-driver
Namespace:        flytesnacks-development
Priority:         0
Service Account:  spark
Node:             tiebreaker-k8s-cluster-dev-cx32-pool-small-static-worker2/10.0.0.7
Start Time:       Sat, 31 Aug 2024 22:59:28 -0600
Labels:           domain=development
                  execution-id=asgp25xdskrs4btpqxmt
                  interruptible=false
                  node-id=n0
                  project=flytesnacks
                  shard-key=8
                  spark-app-name=asgp25xdskrs4btpqxmt-n0-0
                  spark-app-selector=spark-de6e5bfe7939467598c5f1bfabf31dcb
                  spark-role=driver
                  spark-version=3.5.0
                  <http://sparkoperator.k8s.io/app-name=asgp25xdskrs4btpqxmt-n0-0|sparkoperator.k8s.io/app-name=asgp25xdskrs4btpqxmt-n0-0>
                  <http://sparkoperator.k8s.io/launched-by-spark-operator=true|sparkoperator.k8s.io/launched-by-spark-operator=true>
                  <http://sparkoperator.k8s.io/submission-id=d2fb9401-239e-4694-9fa4-69c4aedfd222|sparkoperator.k8s.io/submission-id=d2fb9401-239e-4694-9fa4-69c4aedfd222>
                  task-name=hellospark-hello-spark
                  workflow-name=hellospark-my-spark
Annotations:      <http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>: false
Status:           Failed
IP:               10.244.3.55
IPs:
  IP:  10.244.3.55
Containers:
  spark-kubernetes-driver:
    Container ID:  <containerd://2bcb33865d6e879dae1b1d1af332455b1ba661688b66385e89e67f9d99f7610>9
    Image:         dineshpabbi10/flytekit:W0dMvUWJnzIA_aXHQrHS_A
    Image ID:      <http://docker.io/dineshpabbi10/flytekit@sha256:497b5c3b1e582214a5af75b9b8ebadba8d86592429ec805b30367cba6219d041|docker.io/dineshpabbi10/flytekit@sha256:497b5c3b1e582214a5af75b9b8ebadba8d86592429ec805b30367cba6219d041>
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.deploy.PythonRunner
      local:///usr/local/bin/entrypoint.py
      pyflyte-fast-execute
      --additional-distribution
      <s3://flyte/flytesnacks/development/AFUU2MUU5IGWXVCGNU4USDB6WU======/script_mode.tar.gz>
      --dest-dir
      .
      --
      pyflyte-execute
      --inputs
      <s3://flyte/metadata/propeller/flytesnacks-development-asgp25xdskrs4btpqxmt/n0/data/inputs.pb>
      --output-prefix
      <s3://flyte/metadata/propeller/flytesnacks-development-asgp25xdskrs4btpqxmt/n0/data/0>
      --raw-output-data-prefix
      <s3://flyte/data/eu/asgp25xdskrs4btpqxmt-n0-0>
      --checkpoint-path
      <s3://flyte/data/eu/asgp25xdskrs4btpqxmt-n0-0/_flytecheckpoints>
      --prev-checkpoint
      ""
      --resolver
      flytekit.core.python_auto_container.default_task_resolver
      --
      task-module
      hellospark
      task-name
      hello_spark
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 31 Aug 2024 22:59:29 -0600
      Finished:     Sat, 31 Aug 2024 22:59:41 -0600
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1400Mi
    Requests:
      cpu:     1
      memory:  1400Mi
    Environment:
      SPARK_USER:                 root
      SPARK_APPLICATION_ID:       spark-de6e5bfe7939467598c5f1bfabf31dcb
      FLYTE_START_TIME:           1725166763554
      SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
      PYSPARK_PYTHON:             /usr/bin/python3
      PYSPARK_DRIVER_PYTHON:      /usr/bin/python3
      SPARK_LOCAL_DIRS:           /var/data/spark-8fa32a23-c365-4fea-b725-1591ff3bdbe6
      SPARK_CONF_DIR:             /opt/spark/conf
    Mounts:
      /opt/spark/conf from spark-conf-volume-driver (rw)
      /var/data/spark-8fa32a23-c365-4fea-b725-1591ff3bdbe6 from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-skkt5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  spark-local-dir-1:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  spark-conf-volume-driver:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-drv-43e5be91abf3cf82-conf-map
    Optional:  false
  kube-api-access-skkt5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
                             <http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    10m   default-scheduler  Successfully assigned flytesnacks-development/asgp25xdskrs4btpqxmt-n0-0-driver to tiebreaker-k8s-cluster-dev-cx32-pool-small-static-worker2
  Warning  FailedMount  10m   kubelet            MountVolume.SetUp failed for volume "spark-conf-volume-driver" : configmap "spark-drv-43e5be91abf3cf82-conf-map" not found
  Normal   Pulled       10m   kubelet            Container image "dineshpabbi10/flytekit:W0dMvUWJnzIA_aXHQrHS_A" already present on machine
  Normal   Created      10m   kubelet            Created container spark-kubernetes-driver
  Normal   Started      10m   kubelet            Started container spark-kubernetes-driver
f
I recommend Spark on a proper cluster - k3s single node for Spark might be too small
m
Oh yeah so my k3s cluster is deployed on hetzner cloud with like 5 machines. I just managed to fix this issue by enabling webook enabled value for spark operator
💯 1
🎯 1