It’s quite tricky. There are many evaluation methods out there that measure specific things but by themselves don’t always tell the full story.
It’s also possible to finetune a LLM to target a specific evaluation method.
Then there are leaderboards like lmsys where users evaluate models by sending a prompt and blindly choosing the model with the best response.
The way I’m choosing to evaluate is to try llama in place of other llms in an existing context of problems I already worked on or implemented