Trying to run the `mnist_classifier` example remot...
# ask-the-community
h
Trying to run the
mnist_classifier
example remotely - locally
pytorch_single_node_and_gpu.py
runs fine, but running that remotely I somehow get a permission error:
Copy code
Traceback (most recent call last):

      File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 314, in pytorch_mnist_task
        training_data_loader = mnist_dataloader(hp.batch_size, train=True, **kwargs)
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 103, in mnist_dataloader
        datasets.MNIST(
      File "/usr/local/lib/python3.9/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
        self.download()
      File "/usr/local/lib/python3.9/site-packages/torchvision/datasets/mnist.py", line 179, in download
        os.makedirs(self.raw_folder, exist_ok=True)
      File "/usr/local/lib/python3.9/os.py", line 215, in makedirs
        makedirs(head, exist_ok=exist_ok)
      File "/usr/local/lib/python3.9/os.py", line 215, in makedirs
        makedirs(head, exist_ok=exist_ok)
      File "/usr/local/lib/python3.9/os.py", line 225, in makedirs
        mkdir(name, mode)

Message:

    [Errno 13] Permission denied: './data'

User error.
If I understand that correctly that should save the outputs as it has for the local run, or is the permission needed for the sandbox?
not sure if related, but I have also used
ImageSpec
for this purpose:
Copy code
image = ImageSpec(
    name="pytorch_mnist_classifier",
    registry="localhost:30000",
    packages=["torch", "wandb", "torchvision"],
    env={"WANDB_API_KEY": "<my-key>",
         "WANDB_USERNAME": "<my-username>"},
)
s
@Hud, we really need to update our examples! Can you replace
./data
with
os.path.join(flytekit.current_context().working_directory, "data")
?
h
@Samhita Alla argh - I should've known that change - it definitely worked, but now getting another error:
Copy code
Traceback (most recent call last):

      File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 328, in pytorch_mnist_task
        accuracies.append(test(model, device, test_data_loader))
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 219, in test
        wandb.log({"test_loss": test_loss, "accuracy": accuracy, "mnist_predictions": my_table})
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 341, in wrapper_fn
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 331, in wrapper
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1752, in log
        self._log(data=data, step=step, commit=commit)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1527, in _log
        self._partial_history_callback(data, step, commit)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1397, in _partial_history_callback
        self._backend.interface.publish_partial_history(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 635, in publish_partial_history
        data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 52, in history_dict_to_json
        payload[key] = val_to_json(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 155, in val_to_json
        art = wandb.Artifact(f"run-{run.id}-{sanitized_key}", "run_table")
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py", line 172, in __init__
        self._storage_policy = WandbStoragePolicy(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/storage_policies/wandb_storage_policy.py", line 68, in __init__
        self._cache = cache or get_artifacts_cache()
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 173, in get_artifacts_cache
        _artifacts_cache = ArtifactsCache(cache_dir)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 36, in __init__
        mkdir_exists_ok(self._cache_dir)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/lib/filesystem.py", line 32, in mkdir_exists_ok
        raise PermissionError(f"{dir_name!s} is not writable") from e

Message:

    /home/flytekit/.cache/wandb/artifacts is not writable

User error.
s
Can you disable the cache?
n
looks like a wandb issue? is the wandb API key set?
h
I think so
Copy code
image = ImageSpec(
    name="pytorch_mnist_classifier",
    registry="localhost:30000",
    packages=["torch", "wandb", "torchvision"],
    env={"WANDB_API_KEY": "<my-key>",
         "WANDB_USERNAME": "<my-username>"},
)

def wandb_setup():
    wandb.login(key=os.environ.get("WANDB_API_KEY"))
    wandb.init(
        project="mnist-single-node-single-gpu",
        entity=os.environ.get("WANDB_USERNAME"),
    )
Disabling the cache didn't work
s
Can you set
WANDB_CACHE_DIR
environment variable to say
os.path.join(flytekit.current_context().working_directory, "wandb-artifacts")
?
h
Hm, getting this error with cache turned on:
Copy code
Traceback (most recent call last):

      File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
        return wrapped(*args, **kwargs)
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 332, in pytorch_mnist_task
        accuracies.append(test(model, device, test_data_loader))
      File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 223, in test
        wandb.log({"test_loss": test_loss, "accuracy": accuracy, "mnist_predictions": my_table})
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 341, in wrapper_fn
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 331, in wrapper
        return func(self, *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1752, in log
        self._log(data=data, step=step, commit=commit)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1527, in _log
        self._partial_history_callback(data, step, commit)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1397, in _partial_history_callback
        self._backend.interface.publish_partial_history(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 635, in publish_partial_history
        data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 52, in history_dict_to_json
        payload[key] = val_to_json(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 155, in val_to_json
        art = wandb.Artifact(f"run-{run.id}-{sanitized_key}", "run_table")
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py", line 172, in __init__
        self._storage_policy = WandbStoragePolicy(
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/storage_policies/wandb_storage_policy.py", line 68, in __init__
        self._cache = cache or get_artifacts_cache()
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 173, in get_artifacts_cache
        _artifacts_cache = ArtifactsCache(cache_dir)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 36, in __init__
        mkdir_exists_ok(self._cache_dir)
      File "/usr/local/lib/python3.9/site-packages/wandb/sdk/lib/filesystem.py", line 32, in mkdir_exists_ok
        raise PermissionError(f"{dir_name!s} is not writable") from e

Message:

    /var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyteavnroo7i/user_space/wandb-artifacts/artifacts is not writable

User error.
n
this looks like a write permissions issue. is there a way to disable it?
can you try setting
WANDB_CACHE_DIR
to
./wandb-artifacts
? I’m not sure if anything under
/var
will be write-able
h
Copy code
Failed with Unknown Exception <class 'Exception'> Reason: failed to run command envd build --path /var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyte-bbfhb6zu/sandbox/local_flytekit/f99195e97faf77bf83e33acad2d9c89c  --platform linux/amd64 --output type=image,name=localhost:30000/pytorch_mnist_classifier:I4sJNDy0V9zduNimNFlVDA..,push=true with error b'time="2023-08-18T16:27:18+08:00" level=fatal msg="failed to build the image: failed to build: failed to wait error group: failed to solve LLB: failed to solve: Canceled: context canceled"\n'
failed to run command envd build --path /var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyte-bbfhb6zu/sandbox/local_flytekit/f99195e97faf77bf83e33acad2d9c89c  --platform linux/amd64 --output type=image,name=localhost:30000/pytorch_mnist_classifier:I4sJNDy0V9zduNimNFlVDA..,push=true with error b'time="2023-08-18T16:27:18+08:00" level=fatal msg="failed to build the image: failed to build: failed to wait error group: failed to solve LLB: failed to solve: Canceled: context canceled"\n'
That error was at the terminal - how do you mean @Niels Bantilan disable what exactly?
s
When do you see the above error?
n
is this all using
pyflyte run
locally? how are you running this?
h
The error above after setting
WANDB_CACHE_DIR
to
./wandb-artifacts
I ran
pyflyte run --remote mnist_classifier/pytorch_single_node_and_gpu.py pytorch_training_wf --hp '{"epochs": 10, "batch_size": 128}'
s
Not sure what's happening. Will need to try reproducing the error.
n
okay, so it looks like this is being run locally correct?
h
sorry for the delay! @Niels Bantilan that is correct