One interesting think about the latest want of AI is the idea that a model can do it all. Typically a new development is evaluated on discrete tasks to evaluate it’s performance.

But because LLM input is so open-ended, there has been a shift in output evaluation. Models are evaluated on tasks. LLMs are evaluated more holistically. That may not be a good thing.

When I take a math test, I’m evaluated on that subject. When I apply for a kob, I’m evaluated more holistically to see if there is a fit. When I’m building relationships with people, there is another type of evaluation that takes place.

Which scale is most appropriate for an LLM?