Hud
08/13/2023, 5:21 AMmnist_classifier
example remotely - locally pytorch_single_node_and_gpu.py
runs fine, but running that remotely I somehow get a permission error:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
return wrapped(*args, **kwargs)
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 314, in pytorch_mnist_task
training_data_loader = mnist_dataloader(hp.batch_size, train=True, **kwargs)
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 103, in mnist_dataloader
datasets.MNIST(
File "/usr/local/lib/python3.9/site-packages/torchvision/datasets/mnist.py", line 99, in __init__
self.download()
File "/usr/local/lib/python3.9/site-packages/torchvision/datasets/mnist.py", line 179, in download
os.makedirs(self.raw_folder, exist_ok=True)
File "/usr/local/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/local/lib/python3.9/os.py", line 215, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/usr/local/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
Message:
[Errno 13] Permission denied: './data'
User error.
If I understand that correctly that should save the outputs as it has for the local run, or is the permission needed for the sandbox?ImageSpec
for this purpose:
image = ImageSpec(
name="pytorch_mnist_classifier",
registry="localhost:30000",
packages=["torch", "wandb", "torchvision"],
env={"WANDB_API_KEY": "<my-key>",
"WANDB_USERNAME": "<my-username>"},
)
Samhita Alla
./data
with os.path.join(flytekit.current_context().working_directory, "data")
?Hud
08/14/2023, 11:11 AMTraceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
return wrapped(*args, **kwargs)
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 328, in pytorch_mnist_task
accuracies.append(test(model, device, test_data_loader))
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 219, in test
wandb.log({"test_loss": test_loss, "accuracy": accuracy, "mnist_predictions": my_table})
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 341, in wrapper_fn
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 331, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1752, in log
self._log(data=data, step=step, commit=commit)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1527, in _log
self._partial_history_callback(data, step, commit)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1397, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 635, in publish_partial_history
data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 52, in history_dict_to_json
payload[key] = val_to_json(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 155, in val_to_json
art = wandb.Artifact(f"run-{run.id}-{sanitized_key}", "run_table")
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py", line 172, in __init__
self._storage_policy = WandbStoragePolicy(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/storage_policies/wandb_storage_policy.py", line 68, in __init__
self._cache = cache or get_artifacts_cache()
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 173, in get_artifacts_cache
_artifacts_cache = ArtifactsCache(cache_dir)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 36, in __init__
mkdir_exists_ok(self._cache_dir)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/lib/filesystem.py", line 32, in mkdir_exists_ok
raise PermissionError(f"{dir_name!s} is not writable") from e
Message:
/home/flytekit/.cache/wandb/artifacts is not writable
User error.
Samhita Alla
Niels Bantilan
08/14/2023, 2:09 PMHud
08/14/2023, 11:03 PMimage = ImageSpec(
name="pytorch_mnist_classifier",
registry="localhost:30000",
packages=["torch", "wandb", "torchvision"],
env={"WANDB_API_KEY": "<my-key>",
"WANDB_USERNAME": "<my-username>"},
)
def wandb_setup():
wandb.login(key=os.environ.get("WANDB_API_KEY"))
wandb.init(
project="mnist-single-node-single-gpu",
entity=os.environ.get("WANDB_USERNAME"),
)
Disabling the cache didn't workSamhita Alla
WANDB_CACHE_DIR
environment variable to say os.path.join(flytekit.current_context().working_directory, "wandb-artifacts")
?Hud
08/17/2023, 1:43 AMTraceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/flytekit/exceptions/scopes.py", line 206, in user_entry_point
return wrapped(*args, **kwargs)
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 332, in pytorch_mnist_task
accuracies.append(test(model, device, test_data_loader))
File "/root/mnist_classifier/pytorch_single_node_and_gpu.py", line 223, in test
wandb.log({"test_loss": test_loss, "accuracy": accuracy, "mnist_predictions": my_table})
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 341, in wrapper_fn
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 331, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1752, in log
self._log(data=data, step=step, commit=commit)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1527, in _log
self._partial_history_callback(data, step, commit)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/wandb_run.py", line 1397, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/interface/interface.py", line 635, in publish_partial_history
data = history_dict_to_json(run, data, step=user_step, ignore_copy_err=True)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 52, in history_dict_to_json
payload[key] = val_to_json(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/data_types/utils.py", line 155, in val_to_json
art = wandb.Artifact(f"run-{run.id}-{sanitized_key}", "run_table")
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py", line 172, in __init__
self._storage_policy = WandbStoragePolicy(
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/storage_policies/wandb_storage_policy.py", line 68, in __init__
self._cache = cache or get_artifacts_cache()
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 173, in get_artifacts_cache
_artifacts_cache = ArtifactsCache(cache_dir)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/artifacts/artifacts_cache.py", line 36, in __init__
mkdir_exists_ok(self._cache_dir)
File "/usr/local/lib/python3.9/site-packages/wandb/sdk/lib/filesystem.py", line 32, in mkdir_exists_ok
raise PermissionError(f"{dir_name!s} is not writable") from e
Message:
/var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyteavnroo7i/user_space/wandb-artifacts/artifacts is not writable
User error.
Niels Bantilan
08/17/2023, 2:13 PMWANDB_CACHE_DIR
to ./wandb-artifacts
? I’m not sure if anything under /var
will be write-ableHud
08/18/2023, 8:29 AMFailed with Unknown Exception <class 'Exception'> Reason: failed to run command envd build --path /var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyte-bbfhb6zu/sandbox/local_flytekit/f99195e97faf77bf83e33acad2d9c89c --platform linux/amd64 --output type=image,name=localhost:30000/pytorch_mnist_classifier:I4sJNDy0V9zduNimNFlVDA..,push=true with error b'time="2023-08-18T16:27:18+08:00" level=fatal msg="failed to build the image: failed to build: failed to wait error group: failed to solve LLB: failed to solve: Canceled: context canceled"\n'
failed to run command envd build --path /var/folders/ck/j4cw600j7tg2lqmdsgfz0yym0000gn/T/flyte-bbfhb6zu/sandbox/local_flytekit/f99195e97faf77bf83e33acad2d9c89c --platform linux/amd64 --output type=image,name=localhost:30000/pytorch_mnist_classifier:I4sJNDy0V9zduNimNFlVDA..,push=true with error b'time="2023-08-18T16:27:18+08:00" level=fatal msg="failed to build the image: failed to build: failed to wait error group: failed to solve LLB: failed to solve: Canceled: context canceled"\n'
That error was at the terminal - how do you mean @Niels Bantilan disable what exactly?Samhita Alla
Niels Bantilan
08/18/2023, 2:31 PMpyflyte run
locally? how are you running this?Hud
08/18/2023, 11:09 PMWANDB_CACHE_DIR
to ./wandb-artifacts
I ran pyflyte run --remote mnist_classifier/pytorch_single_node_and_gpu.py pytorch_training_wf --hp '{"epochs": 10, "batch_size": 128}'
Samhita Alla
Niels Bantilan
08/21/2023, 3:24 PMHud
08/25/2023, 11:06 PM