Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Only recently the speculation was that Mistral 8x22B might be near GPT 4 level. All of a sudden it’s now LLama 8B might be near mistral levels. Fascinating!!
<https://www.reddit.com/r/LocalLLaMA/s/IQIUHMrbYh|https://www.reddit.com/r/LocalLLaMA/s/IQIUHMrbYh>

I'm playing around with it right now. Nothing intense. Seems pretty good

What's the best way to interpret these scores? How do we assess actual goodness in real life?

It’s quite tricky. There are many evaluation methods out there that measure specific things but by themselves don’t always tell the full story. 

It’s also possible to finetune a LLM to target a specific evaluation method.

Then there are leaderboards like lmsys where users evaluate models by sending a prompt and blindly choosing the model with the best response.

The way I’m choosing to evaluate is to try llama in place of other llms in an existing context of problems I already worked on or implemented