Hi, community. Does anyone have experience moving...
# flyte-support
l
Hi, community. Does anyone have experience moving inference logic from AWS Sagemaker into Flyte?
c
@loud-dentist-4938, do you mean converting code from Sagemaker Studio into Flyte entities? Make sure to look into the sagemaker agent: https://docs.flyte.org/en/latest/flytesnacks/examples/sagemaker­_inference­_agent/index.html
l
I'm not seeing any advantages in using sagemaker_inference_agent vs just calling endpoint APIs via boto3
Copy code
invoke_endpoint = SageMakerInvokeEndpointTask(
    name="sagemaker_invoke_endpoint",
    config={
        "EndpointName": "YOUR_ENDPOINT_NAME_HERE",
        "InputLocation": "<s3://sagemaker-agent-xgboost/inference_input>",
    },
    region=REGION,
)
and boto3
Copy code
untime_client = boto3.client("sagemaker-runtime", region_name=region)
    response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="application/json",
        Body=json.dumps(payload),
    )
    return response["Body"].read().decode("utf-8")
I was asking about actually moving logic of inference (inference.py in terms of Sagemaker) to a Flyte task, which gets executed on custom image with the models unpacked and runs all inference burden (finding tensor with cosine similarities, performing k-nearest neighbors search....)
t
@loud-dentist-4938 are you referring to offline batch inference? or real-time serving?
f
@loud-dentist-4938 we can get you on union, union has realtime serving (inference endpoints) and we have reusable containers that makes it possible to run near realtime usecases (milli seconds)
Would love to have a chat and happy to get you folks on, also if you go quick we will throw in a new year discount 😍
l
@thankful-minister-83577, I was thinking about real-time. My use case is relatively simple, and we are talking about inference on say 1000kb daatframe, og 1000 rows. Flyte worklfow gets triggered once data from a source reaches a certain state, a task in that workflow calls sagemaker endpoint. Most of the time EC2 instance serving the /invocation requests sits underutilized and just keeps eating $. Technically we can bring the same instance with our model to our Flyte cluster, have a task to run on that instance and do all inference logic inside the task. So I was wondering if someone has a good example and can share some insights on how that Flyte-local real-time inference can be done
f
Realtime service isn't something Flyte is really designed for, but you can quite easily stand up something like Ray Serve if you're already in kubernetes to provide an endpoint you can just send a POST request to
f
@loud-dentist-4938 it seems you are talking about batch or async inference and not having to spend money for the instance. If strict interactive sls is not a concern, then you can work a workflow and make the server a sidecar (pod templates) check out NIM / ollama examples or even just call the. Python inference in a task as a library But if you want right sla, union does offer scale from zero And if you want fast workflow Union has extension for Flyte called actor tasks