https://flyte.org logo
Title
f

Fabio Grätz

03/02/2023, 5:29 PM
I’m thinking about writing a type transformer for pydantic’s
BaseModel
(which has several benefits over data classes which is why our ML engineers asked for this).
from pydantic import BaseModel

class Conf(BaseModel):
    ...

@task
def train(conf: Conf):
   ...
Assuming there is a
class BaseModelTransformer(TypeTransformer[BaseModel]):
, would it be invoked if the user specifies
train(conf: Conf)
or only in case they specify
train(conf: BaseModel)
?
My understanding is that this should work because this logic kicks in, is this correct?
y

Yee

03/02/2023, 6:31 PM
yeah this will work.
would you be able to upstream this fabio?
we’re actually thinking of rewriting the dataclass transformer as well, would be nice if they looked the same
the main thing is just that the underlying transformers should be called… that is if you have a pydantic model that has a
StructuredDataset
(or like another separate custom type), the same thing should happen as if it were not in a pydantic model
not familiar enough with pydantic tbh, need to play around with it some more
f

Fabio Grätz

03/02/2023, 6:39 PM
If we go forward with this, I will upstream 👍
we’re actually thinking of rewriting the dataclass transformer as well, would be nice if they looked the same
I think the pydantic base model transformer could be rather simple. For pydantic, I unfortunately need to know which exact class is trying to be deserialized.
def pydantic_encoder(obj):
    if isinstance(obj, BaseModel):
        return {'__pydantic_model__': obj.__class__.__name__, **obj.dict()}
    else:
        return obj

def pydantic_decoder(obj):
    if '__pydantic_model__' in obj:
        model_name = obj.pop('__pydantic_model__')
        model_class = globals()[model_name]
        return model_class(**obj)
    else:
        return obj

serialized = json.dumps(m, default=pydantic_encoder)

reconstructed = json.loads(serialized, object_hook=pydantic_decoder)

assert reconstructed == m
(Maybe not get if from
globals()
, first try, but save the path from which we can get it with
importlib
.) A base model has a
.schema_json()
which I would also save into the protobuf. Then, at deserialization time, we load the class using
importlib
, compare the schemas, and load using
pydantic_decoder
. I’m not 100% convinced of the part with
importlib
😕 but don’t have a better idea yet since we need to know the class when instantiating the python value again. Are you aware of a better approach?
The nice thing if we know the class is that this “integer being converted to float” during a json (de-)serialization roundtrip doesn’t happen. The dataclass_json transformer walks the dataclass and converts back to int, right?
y

Yee

03/02/2023, 6:47 PM
let me play around with this early next week and get back to you?
and thank you
for the upstream
f

Fabio Grätz

03/02/2023, 6:48 PM
let me play around with this early next week and get back to you?
Of course. If you have an idea for a better approach, I can also try it out if you give me the hint ..
b

Bernhard Stadlbauer

03/03/2023, 6:07 PM
We do have a custom de/serializer like this within Pachama, I remember it having some slight drawbacks but not exactly which. I’m currently OOO but will be back Monday to take a look
f

Fabio Grätz

03/03/2023, 7:19 PM
@Yee I gave this some more thoughts and actually, if I’m not mistaken, it can be done way simpler and cleaner than I thought yesterday. What do you think about this implementation? And about what I noted here? (It doesn’t rely on this hack with
globals()
or
importlib
anymore that yesterday I thought I would need.)
@Niels Bantilan about pydantic base model transformer
@Greg Gydush
Let’s compare 🙂
g

Greg Gydush

03/07/2023, 5:14 PM
Want to have a quick call potentially?
f

Fabio Grätz

03/07/2023, 5:36 PM
Interested in your thoughts about the implementation and whether this is interesting to upstream like this or with modifications @Niels Bantilan 🙂
g

Greg Gydush

03/07/2023, 5:37 PM
I think we'd benefit from automatic type registration, so downstream implementers of new models don't have to explicitly register (or is that handled already?)
We also had a lot of issues when we ran this at scale (if you pass model spec to 1,000s of tasks, gRPC starts to become a huge issue) - serializing to file resolves this, but just wanted to point this out!
n

Niels Bantilan

03/07/2023, 5:47 PM
let’s have a call to coordinate this effort? @Eduardo Apolinario (eapolinario) @Fabio Grätz @Greg Gydush what times work for you this week? What timezones are y’all in?
g

Greg Gydush

03/07/2023, 5:52 PM
Very flexible today, tomorrow after 2PM PT, pretty flexible Thursday after 10AM PT until 2PM
f

Fabio Grätz

03/07/2023, 5:53 PM
I’m in CET so it’s ~7pm now. Flexible later today after 8pm. Thursday evening my time/day your time would also work well.
g

Greg Gydush

03/07/2023, 6:07 PM
Would be down for later today (if not too late for you)
f

Fabio Grätz

03/07/2023, 6:10 PM
Depends on when 🙂 In an hour its 8pm here. Would work for me between ~8-10:30
g

Greg Gydush

03/07/2023, 6:10 PM
any time in that slot works for me 🙂
f

Fabio Grätz

03/07/2023, 6:15 PM
Actually I could also do from now on, not 8
g

Greg Gydush

03/07/2023, 6:36 PM
@Niels Bantilan?
n

Niels Bantilan

03/07/2023, 7:46 PM
sorry, my day today is packed with meetings… do you mind if we chat on Thursday 11am PT, 2pm ET, 8pm CET?
g

Greg Gydush

03/07/2023, 8:18 PM
Thurs 11AM works for me!
n

Niels Bantilan

03/07/2023, 8:26 PM
@Fabio Grätz @Greg Gydush mind sending me an email for a calendar invite?
g

Greg Gydush

03/07/2023, 8:29 PM
I can send one over
anyone else to add?
y

Yee

03/07/2023, 8:53 PM
me please
n

Niels Bantilan

03/07/2023, 8:55 PM
also @Eduardo Apolinario (eapolinario) probably