We had to roll back from flytekit v1.5.0 over the ...
# flytekit
m
We had to roll back from flytekit v1.5.0 over the weekend. I'm still fully triaging exactly what caused this, but I want to ping here about it in case it's a larger issue. We use AWSs IAM for Service Accounts (IRSA); we assign an IAM role to the service account which our flyte jobs run under. With flytekit 1.4.2 everything works without fault, however under 1.5.0 our IAM role is not being honored. In our case it's turning into a AccessDenied error from AWS when flyte attempts to PutObject into the data bucket.
y
do you have any more information @Mike Ossareh would help with debugging.
m
@Yee I'm working through creating a small example 👍🏼
y
thank you!
i think the issue might be related to these lines here
m
That link is taking me to the top of a commit; github does this when you're pointing me to a diff that is folded due to length. What file are you intending to point at with this link?
y
data_persistence.py
m
yah, long diff - expanding. ty
y
always thought bitbucket was better…
but that octocat
m
sourceforge.
y
haha
m
😉
reading the test cases makes me think this should work without issue. I'm going to go and bang on some pipes and see what I can make happen from a minimal case.
y
let me know how we can help… this is super important for us to get right.
m
us too 😅
r
Hey, I believe we reproduced that issue with flytekit 1.5.0 on our side. We have a Flyte client responsible for triggering launch plans and executions.
Copy code
self.remote = FlyteRemote(
            config=Config.for_endpoint(
                endpoint=settings.flyte_endpoint,
                insecure=settings.flyte_insecure_connection,
                data_config=DataConfig(
                    s3=S3Config(
                        access_key_id=settings.flyte_data_s3_access_key,
                        secret_access_key=settings.flyte_data_s3_secret_key,
                    )
                ),
            ),
            default_project=settings.flyte_project,
            default_domain=settings.flyte_domain,
            data_upload_location=f"s3://{settings.flyte_data_s3_bucket_name}/data",
        )
Before the bump in 1.5.0 (we were in 1.4.1), everything works as expected. After the bump, we got that issue:
FlyteAssertion('Failed to put data from /tmp/xxxx_cfe6785a-23aa-414a-a03a-e3eec11fb992_h4ibyrj_.zip to s3://bucket/data/xxxxxxxxxx/xxxx_cfe6785a-23aa-414a-a03a-e3eec11fb992_h4ibyrj_.zip (recursive=False).\n\nOriginal exception: Access Denied')
Maybe that will help for the troubleshooting ?
y
is the pod still around?
if you have access to the pod yaml dump, could you look to make sure the service account is still being set correctly? better yet if you can shell attach to the pod and make sure aws sts get-caller-identity is still what you expect.
also by chance is the s3 bucket you’re trying to write to in a different region than where the eks cluster is?
also have you seen this happen for the
inputs.pb
and
outputs.pb
files? Or is it only ever the off-loaded data types (like flytefile, structureddataset, etc.)?
m
My answers to the questions above: • definitely correct service account • I did not shell connect and check `sts get-caller-identity`; that's a really good idea! • s3 bucket is same region I've not gotten a test case up yet; pinning to
<1.5.0
was quick enough for us to do that it sort of stalled out the triaging process - but I will get to this in the next few days.
y
by chance can you pip freeze in the container as well?
just curious what is installed.
and does the docker image install aws-cli also?
m
RE: aws-cli - yes.
y
specifically, i’m interested in
pip show boto3
(if that’s installed i mean)
@Rémy Dubois if you could also, send us a
pip freeze
from within the container or your virtualenv (redact anything sensitive)
r
sure @Yee - there's nothing sensitive in that list of dependencies. Here is the exhaustive list of our service that runs the code exposed above:
Copy code
adal==1.2.7
adlfs==2023.1.0
aiobotocore==2.4.2
aiohttp==3.8.4
aioitertools==0.11.0
aioresponses==0.7.4
aiosignal==1.3.1
alembic==1.10.3
anyio==3.6.2
arrow==1.2.3
async-timeout==4.0.2
asyncache==0.3.1
asyncpg==0.27.0
attrs==22.2.0
awscli==1.27.111
azure-core==1.26.4
azure-datalake-store==0.0.52
azure-identity==1.12.0
azure-storage-blob==12.15.0
binaryornot==0.4.4
black==23.3.0
botocore==1.29.111
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
cfgv==3.3.1
chardet==5.1.0
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
colorama==0.4.4
cookiecutter==2.1.1
coverage==7.2.3
croniter==1.3.14
cryptography==40.0.1
dataclasses-json==0.5.7
decorator==5.1.1
Deprecated==1.2.13
diskcache==5.5.1
distlib==0.3.6
docker==6.0.1
docker-image-py==0.1.12
docstring-parser==0.15
docutils==0.16
execnet==1.9.0
fastapi==0.95.0
filelock==3.11.0
flyteidl==1.3.17
flytekit==1.5.0
frozendict==2.3.7
frozenlist==1.3.3
fsspec==2023.4.0
gcsfs==2023.4.0
gitdb==4.0.10
GitPython==3.1.31
google-api-core==2.11.0
google-auth==2.17.2
google-auth-oauthlib==1.0.0
google-cloud-core==2.3.2
google-cloud-storage==2.8.0
google-crc32c==1.5.0
google-resumable-media==2.4.1
googleapis-common-protos==1.59.0
graphql-core==3.2.3
greenlet==2.0.2
grpcio==1.53.0
grpcio-status==1.53.0
h11==0.14.0
httpcore==0.17.0
httptools==0.5.0
httpx==0.24.0
identify==2.5.22
idna==3.4
importlib-metadata==6.3.0
iniconfig==2.0.0
isodate==0.6.1
jaraco.classes==3.2.3
jeepney==0.8.0
Jinja2==3.1.2
jinja2-time==0.2.0
jmespath==1.0.1
joblib==1.2.0
keyring==23.13.1
kubernetes==26.1.0
Mako==1.2.4
MarkupSafe==2.1.2
marshmallow==3.19.0
marshmallow-enum==1.5.1
marshmallow-jsonschema==0.13.0
more-itertools==9.1.0
msal==1.21.0
msal-extensions==1.0.0
multidict==6.0.4
mypy-extensions==1.0.0
natsort==8.3.1
newrelic==8.8.0
newrelic-telemetry-sdk==0.4.3
nodeenv==1.7.0
numpy==1.24.2
oauthlib==3.2.2
packaging==23.1
pandas==1.5.3
pathspec==0.11.1
platformdirs==3.2.0
pluggy==1.0.0
portalocker==2.7.0
pre-commit==3.2.2
protobuf==4.22.1
protoc-gen-swagger==0.1.0
py==1.11.0
pyarrow==10.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.7
PyJWT==2.6.0
pyOpenSSL==23.1.1
pytest==7.3.0
pytest-asyncio==0.21.0
pytest-cov==4.0.0
pytest-mock==3.10.0
pytest-xdist==3.2.1
python-dateutil==2.8.2
python-dotenv==1.0.0
python-json-logger==2.0.7
python-multipart==0.0.6
python-slugify==8.0.1
pytimeparse==1.1.8
pytz==2023.3
PyYAML==5.4.1
regex==2023.3.23
requests==2.28.2
requests-oauthlib==1.3.1
responses==0.23.1
retry==0.9.2
rfc3986==1.5.0
rsa==4.7.2
s3fs==0.4.2
s3transfer==0.6.0
SecretStorage==3.3.3
six==1.16.0
smmap==5.0.0
sniffio==1.3.0
sortedcontainers==2.4.0
SQLAlchemy==2.0.9
starlette==0.26.1
starlette-context==0.3.6
statsd==3.3.0
strawberry-graphql==0.171.1
tenacity==8.2.2
text-unidecode==1.3
types-PyYAML==6.0.12.9
typing-inspect==0.8.0
typing_extensions==4.5.0
urllib3==1.26.15
uvicorn==0.21.1
uvloop==0.17.0
virtualenv==20.21.0
watchfiles==0.19.0
websocket-client==1.5.1
websockets==11.0.1
wrapt==1.15.0
yarl==1.8.2
zipp==3.15.0
y
@Mike Ossareh are you able to consistently reproduce this? do you still have the 1.5 version registered?
we did some digging here and found some issues related to region caching, so that’s something to maybe try, but you did mention that everything is in one region.
and you didn’t have the flytekit-data-fsspec plugin installed right?
do you have time on monday to chat @Mike Ossareh
m
@Yee I can spin up a cluster, and bump to flytekit 1.5.0 to verify this. We do use flyteplugins-pods, but I'm not sure about flytekit-data-fsspec off hand. We don't use it explicitly for sure.
it's not in our lock file.
@Yee I can be available on Monday, I'm more free on Tuesday though.
My Monday availability is a very slim wedge of time from 2:00pm -> 2:30pm central. #meetings
y
let’s do tuesday.
11am pacific 1 central?
r
FYI, I fixed my issue by reviewing my IAM role permissions. From: (which worked with aws-cli in Flyte 1.4.x)
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "s3-object-lambda:*"
      ],
      "Resource": "arn:aws:s3:::bucket/*"
    }
  ]
}
To:
Copy code
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "s3-object-lambda:*"
      ],
      "Resource": [
        "arn:aws:s3:::bucket",
        "arn:aws:s3:::bucket/*"
      ]
    }
  ]
}
m
relevant part of our policy;
Copy code
{
    "Action": [
        "s3:*"
    ],
    "Effect": "Allow",
    "Resource": [
        "arn:aws:s3:::bucket",
        "arn:aws:s3:::bucket/*"
    ]
},
(I've switched the actual bucket name due to paranoia)
I've just re-run 1.5.0 on both a demonstration cluster (getting prepared for chat with @Yee) and our production cluster - and the problem has seemingly vanished. Here is the data from todays run: https://gist.github.com/ossareh/df38f82306e45c852a2235021f51dca7
@Yee unless this occurs again I don't know if there's value in us meeting honestly. I was on a plane when it last occurred so couldn't reliably triage it. If it occurs again I'm in a better spot to triage and debug it; and can provide the information you've requested.
y
got it
thank you. definitely let us know if it comes up again.
m
will do
So, annoyingly, the problem reared it's head again; different S3 verb this time, but still S3. Also, annoyingly, I was again not able to triage when it happened (occurred over a weekend) and so I wasn't able to get details of the issue. The fix our engineer opted for was to roll back to
flytekit
and
flytekitplugins-pods
=
<1.5.0
. I'm going to roll forward to 1.5.0 again, assuming we'll hit the issue again in the next couple of days while I'll be around to triage.
y
thank you.
b
did you encounter the problem again @Mike Ossareh? this issue is suddenly cropping up in our dev cluster. no changes to iam role
i can confirm we had upgraded our flyte image used by our training workloads in dev to
flytekit==1.5.0
. The issue was “fixed” by rolling back to the previous image (running
1.4.2
)
m
Funny you ask; we literally just got bitten by it just last night.
We figured there may be something up with running flyte 1.3.0 on the backend and flytekit 1.5.0; so we rolled our backend over to flyte 1.5.0. However we’re now seeing spurious failures. It’s my focus today to work out what’s going on. I’ll update here once I work out whats up.
Part of the issue is we don’t have a reliable reproduction case; it seems time based, like some AWS credential is expiring.
106 Views