I’m thinking about writing a type transformer for ...
# ask-the-community
f
I’m thinking about writing a type transformer for pydantic’s
BaseModel
(which has several benefits over data classes which is why our ML engineers asked for this).
Copy code
from pydantic import BaseModel

class Conf(BaseModel):
    ...

@task
def train(conf: Conf):
   ...
Assuming there is a
class BaseModelTransformer(TypeTransformer[BaseModel]):
, would it be invoked if the user specifies
train(conf: Conf)
or only in case they specify
train(conf: BaseModel)
?
My understanding is that this should work because this logic kicks in, is this correct?
y
yeah this will work.
would you be able to upstream this fabio?
we’re actually thinking of rewriting the dataclass transformer as well, would be nice if they looked the same
the main thing is just that the underlying transformers should be called… that is if you have a pydantic model that has a
StructuredDataset
(or like another separate custom type), the same thing should happen as if it were not in a pydantic model
not familiar enough with pydantic tbh, need to play around with it some more
f
If we go forward with this, I will upstream 👍
we’re actually thinking of rewriting the dataclass transformer as well, would be nice if they looked the same
I think the pydantic base model transformer could be rather simple. For pydantic, I unfortunately need to know which exact class is trying to be deserialized.
Copy code
def pydantic_encoder(obj):
    if isinstance(obj, BaseModel):
        return {'__pydantic_model__': obj.__class__.__name__, **obj.dict()}
    else:
        return obj

def pydantic_decoder(obj):
    if '__pydantic_model__' in obj:
        model_name = obj.pop('__pydantic_model__')
        model_class = globals()[model_name]
        return model_class(**obj)
    else:
        return obj

serialized = json.dumps(m, default=pydantic_encoder)

reconstructed = json.loads(serialized, object_hook=pydantic_decoder)

assert reconstructed == m
(Maybe not get if from
globals()
, first try, but save the path from which we can get it with
importlib
.) A base model has a
.schema_json()
which I would also save into the protobuf. Then, at deserialization time, we load the class using
importlib
, compare the schemas, and load using
pydantic_decoder
. I’m not 100% convinced of the part with
importlib
😕 but don’t have a better idea yet since we need to know the class when instantiating the python value again. Are you aware of a better approach?
The nice thing if we know the class is that this “integer being converted to float” during a json (de-)serialization roundtrip doesn’t happen. The dataclass_json transformer walks the dataclass and converts back to int, right?
y
let me play around with this early next week and get back to you?
and thank you
for the upstream
f
let me play around with this early next week and get back to you?
Of course. If you have an idea for a better approach, I can also try it out if you give me the hint ..
b
We do have a custom de/serializer like this within Pachama, I remember it having some slight drawbacks but not exactly which. I’m currently OOO but will be back Monday to take a look
f
@Yee I gave this some more thoughts and actually, if I’m not mistaken, it can be done way simpler and cleaner than I thought yesterday. What do you think about this implementation? And about what I noted here? (It doesn’t rely on this hack with
globals()
or
importlib
anymore that yesterday I thought I would need.)
@Niels Bantilan about pydantic base model transformer
@Greg Gydush
Let’s compare 🙂
g
Want to have a quick call potentially?
f
Interested in your thoughts about the implementation and whether this is interesting to upstream like this or with modifications @Niels Bantilan 🙂
g
I think we'd benefit from automatic type registration, so downstream implementers of new models don't have to explicitly register (or is that handled already?)
We also had a lot of issues when we ran this at scale (if you pass model spec to 1,000s of tasks, gRPC starts to become a huge issue) - serializing to file resolves this, but just wanted to point this out!
n
let’s have a call to coordinate this effort? @Eduardo Apolinario (eapolinario) @Fabio Grätz @Greg Gydush what times work for you this week? What timezones are y’all in?
g
Very flexible today, tomorrow after 2PM PT, pretty flexible Thursday after 10AM PT until 2PM
f
I’m in CET so it’s ~7pm now. Flexible later today after 8pm. Thursday evening my time/day your time would also work well.
g
Would be down for later today (if not too late for you)
f
Depends on when 🙂 In an hour its 8pm here. Would work for me between ~8-10:30
g
any time in that slot works for me 🙂
f
Actually I could also do from now on, not 8
g
@Niels Bantilan?
n
sorry, my day today is packed with meetings… do you mind if we chat on Thursday 11am PT, 2pm ET, 8pm CET?
g
Thurs 11AM works for me!
n
@Fabio Grätz @Greg Gydush mind sending me an email for a calendar invite?
g
I can send one over
anyone else to add?
y
me please
n
also @Eduardo Apolinario (eapolinario) probably
152 Views