I’m having some difficulty working with files in s...
# ask-the-community
s
I’m having some difficulty working with files in s3. I have a setup where I am providing a batch of s3 file paths to a workflow. First task moves them from the s3 location into the s3 buckets used by flyte for processing. The next task will then download them and do some processing on each file. However atm I can’t seem to access the file. Currently this is just trying to use the sandbox env Task 1 is doing something like
Copy code
endpoint_url = "<http://flyte-sandbox-minio.flyte:9000>"
s3 = boto3.client(
    's3',
    endpoint_url=endpoint_url,
    aws_access_key_id="minio",
    aws_secret_access_key="miniostorage",
    use_ssl="false",
)
for doc in inputs:
    doc["original_s3_path"] = doc["s3_path"]
    bucket, key = split_s3_path(doc["s3_path"])
    s3.copy_object(
        CopySource = f"{bucket}/{key}",
        Bucket = 'my-s3-bucket',
        Key = f"flytesnacks/development/{key}"
    )
    doc["s3_path"] = f"<s3://my-s3-bucket/flytesnacks/development/{key}>"

return [
    DocumentData(
        document_id=doc["document_id"],
        file=FlyteFile(path=f"{doc['s3_path']}"),
        metadata=DocumentMetaData(),
    )
    for doc in inputs
]
The 2nd task then does the following
Copy code
for doc in documents:
    # download the file from s3 and read the data
    doc.file.download()
    file = open(doc.file, "r")
    text = file.read()
    # detect the language of the document and assign to the DocumentData.metadata.language_code
    doc.metadata.language_code = detector.detect(text)
    file.close()
When trying to open the file in the 2nd task I just get a
No such file or directory
error. Any ideas?
j
does your
s3.copy_object
operation working? and what is DocumentData class, i would suggest inspecting the doc object before the download. You can run a local execution to test the behavior
s
The
s3.copy_object
copies the file to the
my-s3-bucket/flytesnacks/development/
location and can be seen in the minio browser. DocumentClass is a custom dataclass which wraps the FlyteFile and some other metadata. I have since realised that I should probally use
file.doc.open()
instead of the builtin open. e.g
Copy code
with doc.file.open('r') as file:
            text = file.read()
            # detect the language of the document and assign to the DocumentData.metadata.language_code
            doc.metadata.language_code = detector.detect(text)
j
i see, yeah flyte should pull it down when you open it. i dont really use explicit
.download
method
s
Are you able to get this to work? @Scott Blackwood
s
I did @Samhita Alla. In the end it was just the incorrect use of open. Thanks